You are on page 1of 175

Reducibility and Statistical-Computational Gaps

from Secret Leakage

Matthew Brennan ∗ Guy Bresler †

June 30, 2020


arXiv:2005.08099v2 [cs.CC] 28 Jun 2020

Abstract
Inference problems with conjectured statistical-computational gaps are ubiquitous throughout mod-
ern statistics, computer science, statistical physics and discrete probability. While there has been success
evidencing these gaps from the failure of restricted classes of algorithms, progress towards a more tra-
ditional reduction-based approach to computational complexity in statistical inference has been limited.
These average-case problems are each tied to a different natural distribution, high-dimensional struc-
ture and conjecturally hard parameter regime, leaving reductions among them technically challenging.
Despite a flurry of recent success in developing such techniques, existing reductions have largely been
limited to inference problems with similar structure – primarily mapping among problems representable
as a sparse submatrix signal plus a noise matrix, which is similar to the common starting hardness as-
sumption of planted clique (PC).
The insight in this work is that a slight generalization of the planted clique conjecture – secret leak-
age planted clique (PCρ ), wherein a small amount of information about the hidden clique is revealed –
gives rise to a variety of new average-case reduction techniques, yielding a web of reductions relating
statistical problems with very different structure. Based on generalizations of the planted clique conjec-
ture to specific forms of PCρ , we deduce tight statistical-computational tradeoffs for a diverse range of
problems including robust sparse mean estimation, mixtures of sparse linear regressions, robust sparse
linear regression, tensor PCA, variants of dense k-block stochastic block models, negatively correlated
sparse PCA, semirandom planted dense subgraph, detection in hidden partition models and a universality
principle for learning sparse mixtures. This gives the first reduction-based evidence supporting a num-
ber of statistical-computational gaps observed in the literature [Li17, BDLS17, DKS17, CX16, HWX15,
BBH18, FLWY18, LSLC18, RM14, HSS15, WEAM19, ASW13, VAC17].
We introduce a number of new average-case reduction techniques that also reveal novel connections
to combinatorial designs based on the incidence geometry of Ftr and to random matrix theory. In par-
ticular, we show a convergence result between Wishart and inverse Wishart matrices that may be of
independent interest. The specific hardness conjectures for PCρ implying our statistical-computational
gaps all are in correspondence with natural graph problems such as k-partite, bipartite and hypergraph
variants of PC. Hardness in a k-partite hypergraph variant of PC is the strongest of these conjectures
and sufficient to establish all of our computational lower bounds. We also give evidence for our PCρ
hardness conjectures from the failure of low-degree polynomials and statistical query algorithms. Our
work raises a number of open problems and suggests that previous technical obstacles to average-case
reductions may have arisen because planted clique is not the right starting point. An expanded set of
hardness assumptions, such as PCρ , may be a key first step towards a more complete theory of reductions
among statistical problems.


Massachusetts Institute of Technology. Department of EECS. Email: brennanm@mit.edu.

Massachusetts Institute of Technology. Department of EECS. Email: guy@mit.edu.

1
Contents

I Summary of Results 5

1 Introduction 5
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Desiderata for Average-Case Reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Planted Clique and Secret Leakage 10

3 Problems and Statistical-Computational Gaps 13


3.1 Robust Sparse Mean Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Dense Stochastic Block Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Testing Hidden Partition Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Semirandom Planted Dense Subgraph and the Recovery Conjecture . . . . . . . . . . . . . 18
3.5 Negatively Correlated Sparse Principal Component Analysis . . . . . . . . . . . . . . . . . 19
3.6 Unsigned and Mixtures of Sparse Linear Regressions . . . . . . . . . . . . . . . . . . . . . 20
3.7 Robust Sparse Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.8 Tensor Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.9 Universality for Learning Sparse Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Technical Overview 25
4.1 Rejection Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Dense Bernoulli Rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Design Matrices and Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Decomposing Linear Regression and Label Generation . . . . . . . . . . . . . . . . . . . . 29
4.5 Producing Negative Correlations and Inverse Wishart Matrices . . . . . . . . . . . . . . . . 30
4.6 Completing Tensors from Hypergraphs and Tensor PCA . . . . . . . . . . . . . . . . . . . 31
4.7 Symmetric 3-ary Rejection Kernels and Universality . . . . . . . . . . . . . . . . . . . . . 32
4.8 Encoding Cliques as Structural Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Further Directions and Open Problems 33

II Average-Case Reduction Techniques 35

6 Preliminaries and Problem Formulations 35


6.1 Conventions for Detection Problems and Adversaries . . . . . . . . . . . . . . . . . . . . . 35
6.2 Reductions in Total Variation and Computational Lower Bounds . . . . . . . . . . . . . . . 36
6.3 Problem Formulations as Detection Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7 Rejection Kernels and Reduction Preprocessing 43


7.1 Gaussian Rejection Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Cloning and Planting Diagonals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.3 Symmetric 3-ary Rejection Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2
8 Dense Bernoulli Rotations 50
8.1 Mapping Planted Bits to Spiked Gaussian Tensors . . . . . . . . . . . . . . . . . . . . . . . 50
8.2 Ftr Design Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.3 Ftr Design Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.4 A Random Matrix Alternative to Kr,t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

9 Negatively Correlated Sparse PCA 61


9.1 Reducing to Negative Sparse PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.2 Comparing Wishart and Inverse Wishart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

10 Negative Correlations, Sparse Mixtures and Supervised Problems 69


10.1 Reduction to Imbalanced Sparse Gaussian Mixtures . . . . . . . . . . . . . . . . . . . . . . 69
10.2 Sparse Mixtures of Regressions and Negative Sparse PCA . . . . . . . . . . . . . . . . . . 76

11 Completing Tensors from Hypergraphs 82

III Computational Lower Bounds from PCρ 86

12 Secret Leakage and Hardness Assumptions 86


12.1 Hardness Assumptions and the PCρ Conjecture . . . . . . . . . . . . . . . . . . . . . . . . 86
12.2 Low-Degree Polynomials and the PCρ Conjecture . . . . . . . . . . . . . . . . . . . . . . . 91
12.3 Statistical Query Algorithms and the PCρ Conjecture . . . . . . . . . . . . . . . . . . . . . 96

13 Robustness, Negative Sparse PCA and Supervised Problems 99


13.1 Robust Sparse Mean Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
13.2 Negative Sparse PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
13.3 Mixtures of Sparse Linear Regressions and Robustness . . . . . . . . . . . . . . . . . . . . 104

14 Community Recovery and Partition Models 106


14.1 Dense Stochastic Block Models with Two Communities . . . . . . . . . . . . . . . . . . . . 106
14.2 Testing Hidden Partition Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
14.3 Semirandom Single Community Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

15 Tensor Principal Component Analysis 127

16 Universality of Lower Bounds for Learning Sparse Mixtures 132


16.1 Reduction to Generalized Learning Sparse Mixtures . . . . . . . . . . . . . . . . . . . . . . 132
16.2 The Universality Class UC(n) and Level of Signal τU . . . . . . . . . . . . . . . . . . . . . 137

17 Computational Lower Bounds for Recovery and Estimation 140


17.1 Our Reductions and Computational Lower Bounds for Recovery . . . . . . . . . . . . . . . 140
17.2 Relationship Between Detection and Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 142

IV Appendix 161

3
A Deferred Proofs from Part II 161
A.1 Proofs of Total Variation Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
A.2 Proofs for To-k-Partite-Submatrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
A.3 Proofs for Symmetric 3-ary Rejection Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 166
A.4 Proofs for Label Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

B Deferred Proofs from Part III 170


B.1 Proofs from Secret Leakage and the PCρ Conjecture . . . . . . . . . . . . . . . . . . . . . . 170
B.2 Proofs for Reductions and Computational Lower Bounds . . . . . . . . . . . . . . . . . . . 171

4
Part I
Summary of Results
1 Introduction
Computational complexity has become a central consideration in statistical inference as focus has shifted
to high-dimensional structured problems. A primary aim of the field of mathematical statistics is to deter-
mine how much data is needed for various estimation tasks, and to analyze the performance of practical
algorithms. For a century, the focus has been on information-theoretic limits. However, the study of high-
dimensional structured estimation problems over the last two decades has revealed that the much more
relevant quantity – the amount of data needed by computationally efficient algorithms – may be significantly
higher than what is achievable without computational constraints. These statistical-computational gaps were
first observed to exist more than two decades ago [Val84, Ser99, DGR00] but only recently have emerged
as a trend ubiquitous in problems throughout modern statistics, computer science, statistical physics and
discrete probability [BB08, CJ13, JM15]. Prominent examples arise in estimating sparse vectors from linear
observations, estimating low-rank tensors, community detection, subgraph and matrix recovery problems,
random constraint satisfiability, sparse principal component analysis and robust estimation.
Because statistical inference problems are formulated with probabilistic models on the observed data,
there are natural barriers to basing their computational complexity as average-case problems on worst-case
complexity assumptions such as P 6= NP [FF93, BT06a, ABX08]. To cope with this complication, a number
of different approaches have emerged to provide evidence for conjectured statistical-computational gaps.
These can be roughly classified into two categories:

1. Failure of Classes of Algorithms: Showing that powerful classes of efficient algorithms, such as
statistical query algorithms, the sum of squares (SOS) hierarchy and low-degree polynomials, fail up
to the conjectured computational limit of the problem.

2. Average-Case Reductions: The traditional complexity-theoretic approach showing the existence of


polynomial-time reductions relating statistical-computational gaps in problems to one another.

The line of research providing evidence for statistical-computational gaps through the failure of powerful
classes of algorithms has seen a lot of progress in the past few years. A breakthrough work of [BHK+ 16]
developed the general technique of pseudocalibration for showing SOS lower bounds, and used this method
to prove tight lower bounds for planted clique (PC). In [Hop18], pseudocalibration motivated a general
conjecture on the optimality of low-degree polynomials for hypothesis testing that has been used to pro-
vide evidence for a number of additional gaps [HS17, KWB19, BKW19]. There have also been many
other recent SOS lower bounds [Gri01, DM15b, MW15a, MPW15, KMOW17, HKP+ 18, RSS18, HKP+ 17,
MRX19]. Other classes of algorithms for which there has been progress in a similar vein include statistical
query algorithms [FGR+ 13, FPV15, DKS17, DKS19], classes of circuits [RR97, Ros08, Ros14], local algo-
rithms [GS+ 17, Lin92] and message-passing algorithms [ZK16, LKZ15, LDBB+ 16, KMRT+ 07, RTSZ19,
BPW18]. Another line of work has aimed to provide evidence for computational limits by establishing
properties of the energy landscape of solutions that are barriers to natural optimization-based approaches
[ACO08, GZ17, BMMN17, BGJ18, RBBC19, CGP+ 19, GZ19].
While there has been success evidencing statistical-computational gaps from the failure of these classes
of algorithms, progress towards a traditional reduction-based approach to computational complexity in sta-
tistical inference has been more limited. This is because reductions between average-case problems are more
constrained and overall very different from reductions between worst-case problems. Average-case combi-
natorial problems have been studied in computer science since the 1970’s [Kar77, Kuč77]. In the 1980’s,

5
Levin introduced his theory of average-case complexity [Lev86], formalizing the notion of an average-case
reduction and obtaining abstract completeness results. Since then, average-case complexity has been studied
extensively in cryptography and complexity theory. A survey of this literature can be found in [BT+ 06b]
and [Gol11]. As discussed in [Bar17] and [Gol11], average-case reductions are notoriously delicate and
there is a lack of available techniques. Although technically difficult to obtain, average-case reductions have
a number of advantages over other approaches. Aside from the advantage of being future-proof against
new classes of algorithms, showing that a problem of interest is hard by reducing from PC effectively sub-
sumes hardness for classes of algorithms known to fail on PC and thus gives stronger evidence for hardness.
Reductions preserving gaps also directly relate phenomena across problems and reveal insights into how
parameters, hidden structures and noise models correspond to one another.
Worst-case reductions are only concerned with transforming the hidden structure in one problem to
another. For example, a worst-case reduction from 3- SAT to k- INDEPENDENT- SET needs to ensure that
the hidden structure of a satisfiable 3- SAT formula is mapped to a graph with an independent set of size
k, and that an unsatisfiable formula is not. Average-case reductions need to not only transform the struc-
ture in one problem to that of another, but also precisely map between the natural distributions associated
with problems. In the case of the example above, all classical worst-case reductions use gadgets that map
random 3- SAT formulas to a very unnatural distribution on graphs. Average-case problems in statistical in-
ference are also fundamentally parameterized, with parameter regimes in which the problem is information-
theoretically impossible, possible but conjecturally computationally hard and computationally easy. To
establish the strongest possible lower bounds, reductions need to exactly fill out one of these three parame-
ter regimes – the one in which the problem is conjectured to be computationally hard. These subtleties that
arise in devising average-case reductions will be discussed further in Section 1.2.
Despite these challenges, there has been a flurry of recent success in developing techniques for average-
case reductions among statistical problems. Since the seminal paper of [BR13a] showing that a statistical-
computational gap for a distributionally-robust formulation of sparse PCA follows from the PC conjecture,
there have been a number of average-case reductions among statistical problems. Reductions from PC have
been used to show lower bounds for RIP certification [WBP16, KZ14], biclustering detection and recov-
ery [MW15b, CLR15, CW18, BBH19], planted dense subgraph [HWX15, BBH19], testing k-wise indepen-
dence [AAK+ 07], matrix completion [Che15] and sparse PCA [BR13b, BR13a, WBS16, GMZ17, BB19b].
Several reduction techniques were introduced in [BBH18], providing the first web of average-case reduc-
tions among a number of problems involving sparsity. More detailed surveys of these prior average-case
reductions from PC can be found in the introduction section of [BBH18] and in [WX18]. There also have
been a number of average-case reductions in the literature starting with different assumptions than the PC
conjecture. Hardness conjectures for random CSPs have been used to show hardness in improper learning
complexity [DLSS14], learning DNFs [DSS16] and hardness of approximation [Fei02]. Recent reductions
also map from a 3-uniform hypergraph variant of the PC conjecture to SVD for random 3-tensors [ZX17]
and between learning two-layer neural networks and tensor decomposition [MM18b].
A common criticism to the reduction-based approach to computational complexity in statistical infer-
ence is that, while existing reductions have introduced nontrivial techniques for mapping precisely between
different natural distributions, they are not yet capable of transforming between problems with dissimilar
high-dimensional structures. In particular, the vast majority of the reductions referenced above map among
problems representable as a sparse submatrix signal plus a noise matrix, which is similar to the common
starting hardness assumption PC. Such a barrier would be fatal to a satisfying reduction-based theory of
statistical-computational gaps, as the zoo of statistical problems with gaps contains a broad range of very
different high-dimensional structures. This leads directly to the following central question that we aim to
address in this work.

Question 1.1. Can statistical-computational gaps in problems with different high-dimensional structures be

6
PC⇢ conjecture
<latexit sha1_base64="F5xff6u/4Lp0J1CGrwuGru7Zjbg=">AAACD3icbVC9TsMwGHTKXyl/AUYWiwqJqUr4EYwVXRiLREulJooc12ms2nFkO5WqqA/BzgqvwIZYeQTegMfAaTPQlpMsne58nz9fmDKqtON8W5W19Y3Nrep2bWd3b//APjzqKpFJTDpYMCF7IVKE0YR0NNWM9FJJEA8ZeQpHrcJ/GhOpqEge9SQlPkfDhEYUI22kwLY9jnQsed5uTQNPxiKw607DmQGuErckdVCiHdg/3kDgjJNEY4aU6rtOqv0cSU0xI9OalymSIjxCQ9I3NEGcKD+fbT6FZ0YZwEhIcxINZ+rfRI55KOkw1gtzcsSVmvDQ5Ivt1bJXiP95/UxHt35OkzTTJMHz56OMQS1gUQ4cUEmwZhNDEJbU/ADiGEmEtamwZqpxl4tYJd2LhnvZuH64qjfvypKq4AScgnPgghvQBPegDToAgzF4Aa/gzXq23q0P63N+tWKVmWOwAOvrFwzznRE=</latexit>

k-partite
hypergraph PC
1 Produce negative correlations k-partite PC
with inverted Wishart detection 8
in hidden 5
bipartite PC k-part
2 Dense Bernoulli rotations with K2,t partition 3 bipartite PC
4 planted
<latexit sha1_base64="2avDhTbngFCGceU0702EyZcwSy8=">AAACBHicbVDLSgMxFM3UV62vqks3wSK4kDJTFV1JwY3gpoJ9QDuUTJppQ5PMkNwRytCte7f6C+7Erf/hH/gZpu0sbOuBwOGcnOTeE8SCG3Ddbye3srq2vpHfLGxt7+zuFfcPGiZKNGV1GolItwJimOCK1YGDYK1YMyIDwZrB8HbiN5+YNjxSjzCKmS9JX/GQUwJWat5308oZjLvFklt2p8DLxMtICWWodYs/nV5EE8kUUEGMaXtuDH5KNHAq2LjQSQyLCR2SPmtbqohkxk+n447xiVV6OIy0PQrwVP2bSKkMNO8PYO6dlEhjRjKweUlgYBa9ifif104gvPZTruIEmKKz78NEYIjwpBHc45pRECNLCNXcboDpgGhCwfZWsNV4i0Usk0al7J2XLx8uStWbrKQ8OkLH6BR56ApV0R2qoTqiaIhe0Ct6c56dd+fD+ZxdzTlZ5hDNwfn6BbHImII=</latexit>

models 1
1 4 4 2 subtensor
3 Dense Bernoulli rotations with K3,t
<latexit sha1_base64="txOoGQm2fkNaGnt8FjpuTkC03m8=">AAACBHicbVDLSgMxFM3UV62vqks3wSK4kDJjFV1JwY3gpoJ9QDuUTJppQ5PMkNwRytCte7f6C+7Erf/hH/gZpu0sbOuBwOGcnOTeE8SCG3Ddbye3srq2vpHfLGxt7+zuFfcPGiZKNGV1GolItwJimOCK1YGDYK1YMyIDwZrB8HbiN5+YNjxSjzCKmS9JX/GQUwJWat5308oZjLvFklt2p8DLxMtICWWodYs/nV5EE8kUUEGMaXtuDH5KNHAq2LjQSQyLCR2SPmtbqohkxk+n447xiVV6OIy0PQrwVP2bSKkMNO8PYO6dlEhjRjKweUlgYBa9ifif104gvPZTruIEmKKz78NEYIjwpBHc45pRECNLCNXcboDpgGhCwfZWsNV4i0Usk8Z52auULx8uStWbrKQ8OkLH6BR56ApV0R2qoTqiaIhe0Ct6c56dd+fD+ZxdzTlZ5hDNwfn6BbNlmIM=</latexit>

semirandom
community negative 4
4 Dense Bernoulli rotations with Kr,t <latexit sha1_base64="PmnTEP1e8T2FA9NH1i/fsawoj0M=">AAACBHicbVDLSgMxFL3js9ZX1aWbwSK4kDLjA11JwY3gpoJ9QFtKJs20oUlmSO4IZejWvVv9BXfi1v/wD/wM03YWtvVA4HBOTnLvCWLBDXret7O0vLK6tp7byG9ube/sFvb2ayZKNGVVGolINwJimOCKVZGjYI1YMyIDwerB4Hbs15+YNjxSjziMWVuSnuIhpwStVL/vpPoUR51C0St5E7iLxM9IETJUOoWfVjeiiWQKqSDGNH0vxnZKNHIq2CjfSgyLCR2QHmtaqohkpp1Oxh25x1bpumGk7VHoTtS/iZTKQPNeH2feSYk0ZigDm5cE+2beG4v/ec0Ew+t2ylWcIFN0+n2YCBcjd9yI2+WaURRDSwjV3G7g0j7RhKLtLW+r8eeLWCS1s5J/Xrp8uCiWb7KScnAIR3ACPlxBGe6gAlWgMIAXeIU359l5dz6cz+nVJSfLHMAMnK9fGReYwg==</latexit>
recovery sparse imbalanced balanced
PCA sparse sparse tensor
6
5 Dense Bernoulli rotations with design tensors imbalanced Gaussian Gaussian PCA
2-SBM
6 mixtures mixtures
6 6
6 LR decomposition and label generation
robust unsigned 7
7 Symmetric 3-ary rejection kernels SLR SLR

8 Multi-query reduction completing tensors universality


from hypergraphs robust sparse for learning
mixture of SLRs
mean estimation sparse mixtures

Figure 1: The web of reductions carried out in this paper. An edge indicates existence of a reduction transferring com-
putational hardness from the tail to the head. Edges are labeled with associated reduction techniques and unlabelled
edges correspond to simple reductions or specializing a problem to a particular case.

related to one another through average-case reductions?

1.1 Overview
The main objective of this paper is to provide the first evidence that relating differently structured statis-
tical problems through reductions is possible. We show that mild generalizations of the PC conjecture to
k-partite and bipartite variants of PC are naturally suited to a number of new average-case reduction tech-
niques. These techniques map to problems breaking out of the sparse submatrix plus noise structure that
seemed to constrain prior reductions. They thus show that revealing a tiny amount of information about the
hidden clique vertices substantially increases the reach of the reductions approach, providing the first web
of reductions among statistical problems with significantly different structure. Our techniques also yield
reductions beginning from hypergraph variants of PC which, along with the k-partite and bipartite variants
mentioned above, can be unified under a single assumption that we introduce – the secret leakage planted
clique (PCρ ) conjecture. This conjecture makes a precise prediction of what information about the hidden
clique can be revealed while PC remains hard.
A summary of our web of average-case reductions is shown in Figure 1. Our reductions yield tight
statistical-computational gaps for a range of differently structured problems, including robust sparse mean
estimation, variants of dense stochastic block models, detection in hidden partition models, semirandom
planted dense subgraph, negatively correlated sparse PCA, mixtures of sparse linear regressions, robust
sparse linear regression, tensor PCA and a universality principle for learning sparse mixtures. This gives
the first reduction-based evidence supporting a number of gaps observed in the literature [Li17, BDLS17,
DKS17, CX16, HWX15, BBH18, FLWY18, LSLC18, RM14, HSS15, WEAM19, ASW13, VAC17]. In par-
ticular, there are no known reductions deducing these gaps from the ordinary PC conjecture. Similar to
[BBH18], several average-case problems emerge as natural intermediates in our reductions, such as nega-

7
tive sparse PCA and imbalanced sparse Gaussian mixtures. The specific instantiations of the PCρ conjecture
needed to obtain these lower bounds correspond to natural k-partite, bipartite and hypergraph variants of PC.
Among these hardness assumptions, we show that hardness in a k-partite hypergraph variant of PC (k- HPCs )
is the strongest and sufficient to establish all of our computational lower bounds. We also give evidence for
our hardness conjectures from the failure of low-degree polynomials and statistical query algorithms.
Our results suggest that PC may not be the right starting point for average-case reductions among sta-
tistical problems. However, surprisingly mild generalizations of PC are all that are needed to break beyond
the structural constraints of previous reductions. Generalizing to either PCρ or k- HPCs unifies all of our re-
ductions under a single hardness assumption, now capturing reductions to a range of dissimilarly structured
problems including supervised learning tasks and problems over tensors. This suggests PCρ and k- HPCs are
both much more powerful candidate starting points than PC and, more generally, that these may be a key
first step towards a more complete theory of reductions among statistical problems. Although we often will
focus on providing evidence for statistical-computational gaps, we emphasize that our main contribution
is more general – our reductions give a new set of techniques for relating differently structured statistical
problems that seem likely to have applications beyond the problems we consider here.
The rest of the paper is structured as follows. The next section gives general background on average-
case reductions and several criteria that they must meet in order to show strong computational lower bounds
for statistical problems. In Section 2, we introduce the PCρ conjecture and the specific instantiations of this
conjecture that imply our computational lower bounds, such as k- HPCs . In Section 3 we formally introduce
the problems in Figure 1 and state our main theorems. In Section 4, we describe the key ideas underlying
our techniques and we conclude Part I by discussing a number of questions arising from these techniques
in Section 5. Parts II and III are devoted to formally introducing our reduction techniques and applying
them, respectively. Part II begins with Section 6, which introduces reductions in total variation and the
corresponding hypothesis testing formulation for each problem we consider that it will suffice to reduce to.
In the rest of Part II, we introduce our main reduction techniques and give several initial applications of these
techniques to reduce to a subset of the problems that we consider. Part III begins with a further discussion
of the PCρ conjecture, where we show that k- HPCs is our strongest assumption and provide evidence for the
PC ρ conjecture from the failure of low-degree tests and the statistical query model. The remainder of Part
III is devoted to our other reductions and deducing the computational lower bounds in our main theorems
from Section 3. At the end of Part III, we discuss the implications of our reductions to estimation and
recovery formulations of the problems that we consider. Reading Part I, Section 12 and the pseudocode for
our reductions gives an accurate summary of the theorems and ideas in this work. We note that a preliminary
draft of this work containing a small subset of our results appeared in [BB19a].

1.2 Desiderata for Average-Case Reductions


As discussed in the previous section, average-case reductions are delicate and more constrained than their
worst-case counterparts. In designing average-case reductions between problems in statistical inference, the
essential challenge is to reduce to instances that are hard up to the conjectured computational barrier, with-
out destroying the naturalness of the distribution over instances. Dissecting this objective further yields four
general criteria for a reduction between the problems P and P 0 to be deemed to show strong computational
lower bounds for P 0 . These objectives are to varying degrees at odds with one another, which is what makes
devising reductions a challenging task. To illustrate these concepts, our running example will be our reduc-
tion from PCρ to robust sparse linear regression (SLR). Some parts of this discussion are slightly simplified
for clarity. The following are our four criteria.

1. Aesthetics: If P and P 0 each have a specific canonical distribution then a reduction must faithfully
map these distributions to one another. In our example, this corresponds to mapping the independent

8
0-1 edge indicators in a random graph to noisy Gaussian samples of the form y = hβ, Xi + N (0, 1)
with X ∼ N (0, Id ) and where an -fraction are corrupted.
2. Mapping Between Different Structures: A reduction must simultaneously map all possible latent
signals of P to that of P 0 . In our example, this corresponds to mapping each possible clique position
in PCρ to a specific mixture over the hidden vector β. A reduction in this case would also need to
map between possibly very differently structured data, e.g., in robust SLR the dependence of (X, y)
on β is intricate and the -fraction of corrupted samples also produces latent structure across samples.
These are both very different than the planted signal plus noise form of the clique in PCρ .
3. Tightness to Algorithms: A reduction showing computational lower bounds that are tight against
what efficient algorithms can achieve needs to map the conjectured computational limits of P to those
of P 0 . In our example,
√ PC ρ in general has a conjectured limit depending on ρ, which may for instance
be at K = o( N ) when the clique is of size K in a graph with N vertices. In contrast, robust SLR
has the conjectured limit at n = õ(k 2 2 /τ 4 ), where τ is the `2 error to which we wish to estimate β,
k is the sparsity of β and n is the number of samples.
4. Strong Lower Bounds for Parameterized Problems: In order to show that a certain constraint C de-
fines the computational limit of P 0 through this reduction, we need the reduction to fill out the possible
parameter sequences within C. For example, to show that n = õ(k 2 2 /τ 4 ) truly captures the correct
dependence in our computational lower bound for robust SLR, it does not suffice to produce a single
sequence of points (n, k, d, τ, ) for which this is true, or even a one parameter curve. There are four
parameters in the conjectured limit and a reduction showing that this is the correct dependence needs
to fill out any possible combination of growth rates in these parameters permitted by n = õ(k 2 2 /τ 4 ).
The fact that the initial problem P has a conjectured limit depending on only two parameters can
make achieving this criterion challenging.
We remark that the third criterion requires that reductions are information preserving in the sense that they
do not degrade the underlying level signal used by optimal efficient algorithms. This necessitates that the
amount of additional randomness introduced in reductions to achieve aesthetic requirements is negligible.
The fourth criterion arises from the fact that statistical problems are generally described by a tuple of param-
eters and are therefore actually an entire family of problems. A full characterization of the computational
feasibility of a problem therefore requires addressing all possible scalings of the parameters.
All of the reductions carried out in this paper satisfy all four desiderata. Several of the initial reductions
from PC in the literature met most but not all of these criteria. For example, the reductions in [BR13a,
WBS16] to sparse PCA map to a distribution in a distributionally robust formulation of the problem as
opposed to the canonical Gaussian formulation in the spiked covariance model. Similarly [CLR15] reduces
to a distributionally robust formulation of submatrix localization. The reduction in [GMZ17] only shows
tight computational lower bounds for sparse PCA at a particular point in the parameter space when θ = Θ̃(1)
and n = Θ̃(k 2 ). However, a number of reductions in the literature have successfully met all of these four
criteria [MW15b, HWX15, ZX17, BBH18, BB19b, BBH19].
We remark that it can be much easier to only satisfy some of these desiderata – in particular, many
natural reduction ideas meet a subset of these four criteria but fail to show nontrivial computational lower
bounds. For instance, it is often straightforward to construct a reduction that degrades the level of signal.
The simple reduction that begins with PC and randomly subsamples edges with probability n−α yields an
instance of planted dense subgraph with the correct distributional aesthetics. However, this reduction fails to
be tight to algorithms and furthermore fails to show any meaningful tradeoff between the size of the planted
dense subgraph and the sparsity of the graph.
Another natural reduction to robust sparse mean estimation first maps from PC to Gaussian biclustering
using one of the reductions in [MW15b, BBH18, BBH19], computes the sum v of all of the rows of this

9
matrix, then uses Gaussian cloning as in [BBH18] to produce n weak copies of v and finally outputs these
copies with an an -fraction corrupted. This reduction can be verified to produce a valid instance of robust
sparse mean estimation in its canonical Gaussian formulation, but fails to show any nontrivial hardness
above its information-theoretic limit. Conceptually, this is because the reduction is generating the -fraction
of the corruptions itself. On applying a robust sparse mean estimation blackbox to solve PC, the reduction
could just as easily have revealed which samples it corrupted. This would allow the blackbox to only have to
solve sparse mean estimation, which has no statistical-computational gap. In general, a reduction showing
tight computational lower bounds cannot generate a non-negligible amount randomness that produces the
hardness of the target problem. Instead, this -fraction must come from the hidden clique in the input PC
instance. In Section 4.8, we discuss how our reductions obliviously encode cliques into the hidden structures
in the problems we consider.
We also remark that many problems that appear to be similar from the perspective of designing efficient
algorithms can be quite different to reduce to. This arises from differences in their underlying stochastic
models that efficient algorithms do not have to make use of. For example, although ordinary sparse PCA
and sparse PCA with a negative spike can be solved by the same efficient algorithms, the former has a signal
plus noise decomposition while the latter does not and has negatively correlated as opposed to positively
correlated planted entries. We will see that these subtle differences are significant in designing reductions.

2 Planted Clique and Secret Leakage


In this section, we introduce planted clique and our generalization of the planted clique conjecture. In the
planted clique problem (PC), the task is to find the vertex set of a k-clique planted uniformly at random
in an n-vertex Erdős-Rényi graph G. Planted clique can equivalently be formulated as a testing problem
PC (n, k, 1/2) [AAK+ 07] between the two hypotheses

H0 : G ∼ G(n, 1/2) and H1 : G ∼ G(n, k, 1/2)

where G(n, 1/2) denotes the n-vertex Erdős-Rényi graph with edge density 1/2 and G(n, k, 1/2) the distri-
bution resulting from planting a k-clique uniformly at random in G(n, 1/2). This problem can be solved in
quasipolynomial time by searching through all vertex subsets of size (2+) log2 n if k > (2+) log2 n. The

Planted Clique Conjecture is that there is no polynomial time algorithm solving PC(n, k, 1/2) if k = o( n).
There is a plethora of evidence in the literature for the PC conjecture. Spectral algorithms, approximate
message passing, semidefinite programming, nuclear norm minimization and several other polynomial-

time combinatorial approaches all appear to fail to solve PC exactly when k = o( n) [AKS98, FK00,
McS01, FR10, AV11, DGGP14, DM15a, CX16]. Lower bounds against low-degree sum of squares relax-

ations [BHK+ 16] and statistical query algorithms [FGR+ 13] have also been shown up to k = o( n).

Secret Leakage PC. We consider a slight generalization of the planted clique problem, where the input
graph G comes with some information about the vertex set of the planted clique. This corresponds to
the vertices in the k-clique being chosen from some distribution ρ other than the uniform distribution of
k-subsets of [n], as formalized in the following definition.
Definition 2.1 (Secret Leakage PCρ ). Given a distribution ρ on k-subsets of [n], let Gρ (n, k, 1/2) be the
distribution on n-vertex graphs sampled by first sampling G ∼ G(n, 1/2) and S ∼ ρ independently and
then planting a k-clique on the vertex set S in G. Let PCρ (n, k, 1/2) denote the resulting hypothesis testing
problem between H0 : G ∼ G(n, 1/2) and H1 : G ∼ Gρ (n, k, 1/2).
All of the ρ that we will consider will be uniform over the k-subsets that satisfy some constraint. In
the cryptography literature, modifying a problem such as PC with a promise of this form is referred to

10
as information leakage about the secret. There is a large body of work on leakage-resilient cryptography
recently surveyed in [KR19]. The hardness of the Learning with Errors (LWE) problem has been shown to
be unconditionally robust to leakage [DGK+ 10, GKPV10], and it is left as an interesting open problem to
show that a similar statement holds true for PC.
Both PC and PCρ fall under the class of general parameter recovery problems where the task is to find
PS generating the observed graph from a family of distributions {PS }. In the case of PC, PS denotes
the distribution G(n, k, 1/2) conditioned on the k-clique being planted on S. Observe that the conditional
distributions {PS } are the same in PC and PCρ . Secret leakage can be viewed as placing a prior on the
parameter S of interest, rather than changing the main average-case part of the problem – the family {PS }.
When ρ is uniform over a family of k-subsets, secret leakage corresponds to imposing a worst-case constraint
on S. In particular, consider the maximum likelihood estimator (MLE) for a general parameter recovery
problem given by
Ŝ = arg max PS (G)
S∈supp(ρ)

As ρ varies, only the search space of the MLE changes while the objective remains the same. We make the
following precise conjecture of the hardness of PCρ (n, k, 1/2) for the distributions ρ we consider. Given
a distribution ρ, let pρ (s) = PS,S 0 ∼ρ⊗2 [|S ∩ S 0 | = s] be the probability mass function of the size of the
intersection of two independent random sets S and S 0 drawn from ρ.

Conjecture 2.2 (Secret Leakage Planted Clique (PCρ ) Conjecture). Let ρ be one of the distributions on k-
subsets of [n] given below in Conjecture 2.3. Suppose that there is some p0 = on (1) and constant δ > 0
such that pρ (s) satisfies the tail bounds
( 2
2−s if 1 ≤ s2 < d
pρ (s) ≤ p0 · −2d−4
s if s2 ≥ d

for any parameter d = On ((log n)1+δ ). Then there is no polynomial time algorithm solving PCρ (n, k, 1/2).

While this conjecture is only stated for the specific ρ corresponding to the hardness assumptions used in
our reductions, we believe it should hold for a wide class of ρ with sufficient symmetry. The motivation for
the decay condition on pρ in the PCρ conjecture is from low-degree polynomials, which we show in Section
12.2 fail to solve PCρ subject to this condition. The low-degree conjecture – that low-degree polynomials
predict the computational barriers for a broad class of inference problems – has been shown to match conjec-
tured statistical-computational gaps in a number of problems [HS17, Hop18, KWB19, BKW19]. We discuss
this conjecture, the technical conditions arising in its formalizations and how these relate to PCρ in Section
12.2. Specifically, we discuss the importance of symmetry and the requirement on d in generalizing Conjec-
ture 2.2 to further ρ. In contrast to low-degree polynomials, because the SQ model only concerns problems
with a notion of samples, it seems ill-suited to accurately predict the computational barriers in PCρ for every
ρ. However, in Section 12.3, we show SQ lower bounds supporting the PCρ conjecture for specific ρ related
to our hardness assumptions. We also remark that the distribution pρ is an overlap distribution, which has
been linked to conjectured statistical-computational gaps using techniques from statistical physics [ZK16].

Hardness Conjectures for Specific ρ. In our reductions, we will only need the PCρ conjecture for specific
ρ, all of which are simple and correspond to their own hardness conjectures in natural mild variants of PC.
Secret leakage can be viewed as a way to conceptually unify these different assumptions. These ρ all seem
to avoid revealing enough information about S to give rise to new polynomial time algorithms to solve PCρ .
In particular, spectral algorithms consistently seem to match our conjectured computational limits for PCρ
for the different ρ we consider.

11
We now introduce these specific hardness assumptions and briefly outline how each can be produced
from an instance of PCρ . This is more formally discussed in Section 12.1. Let GB (m, n, 1/2) denote the
distribution on bipartite graphs G with parts of size m and n wherein each edge between the two parts is
included independently with probability 1/2.

• k-partite PC: Suppose that k divides n and let E be a partition of [n] into k parts of size n/k. Let
k- PCE (n, k, 1/2) be PCρ (n, k, 1/2) where ρ is uniformly distributed over all k-sets intersecting each
part of E in exactly one element.

• bipartite PC: Let BPC(m, n, km , kn , 1/2) be the problem of testing between H0 : G ∼ GB (m, n, 1/2)
and H1 under which G is formed by planting a complete bipartite graph with km and kn vertices in
the two parts, respectively, in a graph sampled from GB (m, n, 1/2). This problem can be realized as
a bipartite subgraph of an instance of PCρ .

• k-part bipartite PC: Suppose that kn divides n and let E be a partition of [n] into kn parts of size
n/kn . Let k- BPCE (m, n, km , kn , 1/2) be BPC where the kn vertices in the part of size n are uniform
over all kn -sets intersecting each part of E in exactly one element, as in the definition of k- PCE .
As with BPC, this problem can be realized as a bipartite subgraph of an instance of PCρ , now with
additional constraints on ρ to enforce the k-part restriction.

• k-partite hypergraph PC: Let k, n and E be as in the definition of k- PC. Let k- HPCsE (n, k, 1/2)
where s ≥ 3 be the problem of testing between H0 , under which G is an s-uniform Erdős-Rényi
hypergraph where each hyperedge is included independently with probability 1/2, and H1 , under
which G is first sampled from H0 and then a k-clique with one vertex chosen uniformly at random
from each part of E is planted in G. This problem has a simple correspondence with PCρ : there is a
specific ρ that corresponds to unfolding the adjacency tensor of this hypergraph problem into a matrix.
We will show more formally how to produce k- HPCsE (n, k, 1/2) from PCρ in Section 12.1.

Since E is always revealed in these problems, it can without loss of generality be taken to be any partition
of [n] into k equally-sized parts. Consequently, we will often simplify notation by dropping the subscript E
from the above notation. We conjecture the following computational barriers for these graph problems, each
of which matches the decay rate condition on of pρ (s) in PCρ conjecture, as we will show in Section 12.1.

Conjecture 2.3 (Specific Hardness Assumptions). Suppose that m and n are polynomial in one another.
Then there is no poly(n) time algorithm solving the following problems:

1. k- PC(n, k, 1/2) when k = o( n);
√ √
2. BPC(m, n, km , kn , 1/2) when kn = o( n) and km = o( m);
√ √
3. k- BPC(m, n, km , kn , 1/2) when kn = o( n) and km = o( m); and

4. k- HPCs (n, k, 1/2) for s ≥ 3 when k = o( n).

From an entropy viewpoint, the k-partite assumption common to these variants of PCρ only reveals a very
small amount of information about the location of the clique. In particular, both the uniform distribution
over k-subsets and over k-subsets respecting a given partition E have (1 + o(1))k log2 n bits of entropy.
We also remark that the PCρ conjecture, as stated, implies the thresholds in the conjecture above up to
arbitrarily small polynomial factors i.e. where the thresholds are k = O(n1/2− ), kn = O(n1/2− ) and
km = O(m1/2− ) for arbitrarily small  > 0. As we will discuss in 12.2, the low-degree conjecture also
supports the stronger thresholds in Conjecture 2.3. We also note that our reductions continue to show tight
hardness up to arbitrarily small polynomial factors even under these weaker assumptions. As mentioned in

12
Section 1.1, our hardness assumption for k- HPCs is the strongest of those in Conjecture 2.3. Specifically, in
Section 12.1 we give simple reductions showing that (4) in Conjecture 2.3 implies (1), (2) and (3).
We remark that the discussion in this section also applies to planted dense subgraph (PDS) problems. In
the PDS variant of a PC problem, instead of planting a k-clique in G(n, 1/2), a dense subgraph G(k, p) is
planted in G(n, q) where p > q. We conjecture that all of the hardness assumptions remain true for PDS with
constant edge densities 0 < q < p ≤ 1. Note that PC is an instance of PDS with p = 1 and q = 1/2. All
of the reductions beginning with PCρ in this work will also yield reductions beginning from secret leakage
planted dense subgraph problems PDSρ . In particular, they will continue to apply with a small loss in the
amount of signal when q = 1/2 and p = 1/2 + n− for a small constant  > 0. As discussed in [BB19b],
PDS conjecturally has no quasipolynomial time algorithms in this regime and thus our reductions would
transfer lower bounds above polynomial time. In this parameter regime, the barriers of PDS also appear
to be similar to those of detection in the sparsely spiked Wigner model, which also conjecturally has no
quasipolynomial time algorithms [HKP+ 17]. Throughout this work, we will denote the PDS variants of the
problems introduced above by k- PDS(n, k, p, q), BPDS(m, n, km , kn , p, q), k- BPDS(m, n, km , kn , p, q) and
k- HPDSs (n, k, p, q).

3 Problems and Statistical-Computational Gaps


In this section, we introduce the problems we consider and give informal statements of our main theorems,
each of which is a tight computational lower bound implied by a conjecture in the previous section. These
statistical-computational gaps follow from a variety of different average-case reduction techniques that are
outlined in the next section and will be the focus in the rest of this work. Before stating our main results, we
clarify precisely what we mean by solving and showing a computational lower bound for a problem. All of
the computational lower bounds in this section are implied by one of the assumptions in Conjecture 2.3. As
mentioned previously, they also follow from PDS variants of these assumptions or only from the hardness of
k- HPCs , which is the strongest assumption.

Statistical Problems and Algorithms. Every problem P(n, a1 , a2 , . . . , at ) we consider is parameterized


by a natural parameter n and has several other parameters a1 (n), a2 (n), . . . , at (n), which will typically be
implicit functions of n. If P is a hypothesis testing problem with observation X and hypotheses H0 and H1 ,
an algorithm A is deemed to solve P subject to the constraints C if it has asymptotic Type I+II error bounded
away from 1 when (n, a1 , a2 , . . . , at ) ∈ C i.e. if PH0 [A(X) = H1 ] + PH1 [A(X) = H0 ] = 1 − Ωn (1).
Furthermore, we say that there is no algorithm solving P in polynomial time under the constraints C if for
any sequence of parameters {(n, a1 , a2 , . . . , at )}∞
n=1 ⊆ C, there is no polynomial time algorithm solving
P(n, a1 , a2 , . . . , at ) with Type I+II error bounded away from 1 as n → ∞. If P is an estimation problem
with a parameter θ of interest and loss `, then A solves P subject to the constraints C if `(A(X), θ) ≤  is
true with probability 1 − on (1) when (n, a1 , a2 , . . . , at , ) ∈ C, where  = (n) is a function of n.

Computational Lower Bounds. We say there is a computational lower bound for P subject to the con-
straint C if for any sequence of parameters {(n, a1 (n), a2 (n), . . . , at (n))}∞ n=1 ⊆ C there is another sequence
given by {(ni , a01 (ni ), a02 (ni ), . . . , a0t (ni ))}∞
i=1 ⊆ C such that P(n , a0 (n ), a0 (n ), . . . , a0 (n )) cannot be
i 1 i 2 i t i
solved in poly(ni ) time and limi→∞ log a0k (ni )/ log ak (ni ) = 1. In other words, there is a lower bound at C
if, for any sequence s in C, there is another sequence of parameters that cannot be solved in polynomial time
and whose growth matches the growth of a subsequence of s. Thus all of our computational lower bounds
are strong lower bounds in the sense that rather than show that a single sequence of parameters is hard, we
show that parameter sequences filling out all possible growth rates in C are hard.

13
The constraints C will typically take the form of a system of asymptotic inequalities. Furthermore, each
of our computational lower bounds for estimation problems will be established through a reduction to a
hypothesis testing problem which then implies the desired lower bound. The exact formulations for these
intermediate hypothesis testing problems can be found in Section 6.3 and how they also imply lower bounds
for estimation and recovery variants of our problems is discussed in Section 17. Throughout this work, we
will use the terms detection and hypothesis testing interchangeably. We say that two parameters a and b are
polynomial in one another if there is a constant C > 0 such that a1/C ≤ b ≤ aC as a → ∞. Throughout the
paper, we adopt the standard asymptotic notation O(·), Ω(·), o(·), ω(·) and Θ(·). We let Õ(·) and analogous
variants denote these relations up to polylog(n) factors. Here, n is the natural parameter of the problem
under consideration and will typically be clear from context. We remark that the argument of Õ(·) will
often be polynomially large or small in n, in which case our notation recovers the typical definition of Õ(·).
Furthermore, all of these definitions also apply to the discussion in the previous section.

Canonical Simplest Average-Case Formulations. All of our reductions are to the canonical simplest
average-case formulations of the √problems we consider. For example, all k-sparse unit vectors in our lower
bounds are binary and in {0, 1/ k}d , and the rank-1 component in our lower bound for tensor PCA is
sampled from a Rademacher prior. Our reductions are all also to the canonical simple vs. simple hypothesis
testing formulation for each of our problems and, as discussed in [BBH18], this yields strong computational
lower bounds, is often technically more difficulty and crucially allows reductions to naturally be composed
with one another.

3.1 Robust Sparse Mean Estimation


The study of robust estimation began with Huber’s contamination model [Hub92, Hub65] and observations
of Tukey [Tuk75]. Classical robust estimators have typically either been computationally intractable or
heuristic [Hub11, Tuk75, Yat85]. Recent breakthrough works [DKK+ 16, LRV16] gave the first efficient al-
gorithms for high-dimensional robust estimation, which sparked an active line of research into robust algo-
rithms for other high-dimensional problems [ABL14, Li17, BDLS17, CSV17, DKK+ 18, KKM18, DKS19,
HL19, DHL19]. The most canonical high-dimensional robust estimation problem is robust sparse mean
estimation, which has an intriguing statistical-computational gap induced by robustness.
In sparse mean estimation, the observations X1 , X2 , . . . , Xn are n independent samples from N (µ, Id )
where µ is an unknown k-sparse vector in Rd of bounded `2 norm and the task is to estimate µ within an
`2 error of τ . This is a gapless problem, as taking the largest k coordinates of the empirical mean runs in
poly(d) time and achieves the information-theoretically optimal sample complexity of n = Θ(k log d/τ 2 ).
If an -fraction of these samples are corrupted arbitrarily by an adversary, this yields the robust sparse
mean estimation problem RSME(n, k, d, τ, ). As discussed in [Li17, BDLS17], for kµ − µ0 k2 sufficiently
small, it holds that dTV (N (µ, Id ), N (µ0 , Id )) = Θ(kµ − µ0 k2 ). Furthermore, an -corrupted set of samples
can simulate distributions within O() total variation from N (µ, Id ). Therefore -corruption can simulate
N (µ0 , Id ) if kµ0 − µk2 = O() and it is impossible to estimate µ with `2 distance less than this O().
This implies that the minimax rate of estimation for µ is O(), even for very large n. As shown in [Li17,
BDLS17], the information-theoretic threshold for estimating at this rate in the -corrupted model remains at
n = Θ(k log d/2 ) samples. However, the best known polynomial-time p algorithms from [Li17, BDLS17]
require n = Θ̃(k 2 log d/2 ) samples to estimate µ within τ = Θ( log −1 ) in `2 . In Sections 10.1 and
13.1, we give a reduction showing that these polynomial time algorithms are optimal, yielding the first
average-case evidence for the k-to-k 2 statistical-computational gap conjectured in [Li17, BDLS17]. Our
reduction applies to more general rates τ and obtains the following tradeoff.

14

Theorem 3.1 (Lower Bounds for RSME). If k, d and n are polynomial in each other, k = o( d) and  < 1/2
is such that (n, −1 ) satisfies condition ( T ), then the k- BPC conjecture implies that there is a computational
lower bound for RSME(n, k, d, τ, ) at all sample complexities n = õ(k 2 2 /τ 4 ).
For example, taking  = 1/3 and τ = Õ(1) shows that there is a k-to-k 2 gap between the information-
theoretically optimal sample complexity of n = Θ̃(k) and the computational lower bound of n = õ(k 2 ).
Note that taking τ = O() in Theorem 3.1 recovers exactly the tradeoff in [Li17, BDLS17], with the depen-
dence on . Our reduction to RSME is based on dense Bernoulli rotations and constructions of combinatorial
design matrices based on incidence geometry in Ftr , as is further discussed in Sections 4 and 8.
In Theorem 3.1, ( T ) denotes a technical condition arising from number-theoretic constraints in our
reduction that require that −1 = no(1) or −1 = Θ̃(n−1/2t ) for some positive integer t. As −1 = no(1)
is the primary regime of interest in the RSME literature, this condition is typically trivial. We discuss the
condition ( T ) in more detail in Section 13 and give an alternate reduction removing it from Theorem 3.1 in
the case where  = Θ̃(n−c ) for some constant c ∈ [0, 1/2].
Our result also holds in the stronger Huber’s contamination model where an -fraction of the n samples
are chosen at random and replaced with i.i.d. samples from another distribution D. The prior work of
[DKS17] showed that SQ algorithms require n = Ω̃(k 2 ) samples to solve RSME, establishing the conjectured
k-to-k 2 gap in the SQ model. However, our work is the first to make a precise prediction of the computational
barrier in RSME as a function of both  and τ . As will be discussed in Section 10.1, our reduction from k- PC
maps to the instance of RSME under the adversary introduced in [DKS17].

3.2 Dense Stochastic Block Models


The stochastic block model (SBM) is the canonical model for community detection, having independently
emerged in the machine learning and statistics [HLL83], computer science [BCLS87, DF89, Bop87], statis-
tical physics [DKMZ11] and mathematics communities [BJR07]. It has been the subject of a long line of
research, which has recently been surveyed in [Abb17, Moo17]. In the k-block SBM, a vertex set of size
n is uniformly at random partitioned into k latent communities C1 , C2 , . . . , Ck each of size n/k and edges
are then included in the graph G independently such that intra-community edges appear with probability
p while inter-community edges appear with probability q < p. The exact recovery problem entails find-
ing C1 , C2 , . . . , Ck and the weak recovery problem, also known as community detection, entails outputting
nontrivial estimates Ĉ1 , Ĉ2 , . . . , Ĉk with |Ci ∩ Ĉi | ≥ (1 + Ω(1))n/k.
Community detection in the SBM is often considered in the sparse regime, where p = a/n and q = b/n.
In [DKMZ11], non-rigorous arguments from statistical physics were used to form the precise conjecture
that weak recovery begins to be possible in poly(n) time exactly at the Kesten-Stigum threshold SNR =
(a − b)2 /k(a + (k − 1)b) > 1. When k = 2, the algorithmic side of this conjecture was confirmed
with methods based on belief propagation [MNS18], spectral methods and non-backtracking walks [Mas14,
BLM15], and it was shown to be information-theoretically impossible to solve weak recovery below the
Kesten-Stigum threshold in [MNS15, DAM15]. The algorithmic side of this conjecture for general k was
subsequently resolved with approximate acyclic belief propagation in [AS15, AS16, AS18] and has also
been shown using low-degree polynomials, tensor decomposition and color coding [HS17]. A statistical-
computational gap is conjectured to already arise at k = 4 [AS18] and the information-theoretic limit for
community detection has been shown to occur for large k at SNR = Θ(log k/k), which is much lower
than the Kesten-Stigum threshold [BMNN16]. Rigorous evidence for this statistical-computational gap has
been much more elusive and has only been shown for low-degree polynomials [HS17] and variants of belief
propagation. Another related line of work has exactly characterized the thresholds for exact recovery in the
regime p, q = Θ(log n/n) when k = 2 [ABH15, HWX16a, HWX16b].
The k-block SBM for general edge densities p and q has also been studied extensively under the
names graph clustering and graph partitioning in the statistics and computer science communities. A

15
long line of work has developed algorithms recovering the latent communities in this regime, including
a wide range of spectral and convex programming techniques [Bop87, DF89, CK01, McS01, BS04, CO10,
RCY+ 11, CCT12, NN12, CSX12, Ame14, AGHK14, CSX14, CX16]. A comparison and survey of these re-

sults can be found in [CSX14]. As discussed in [CX16], for growing k satisfying k = O( n) and p and q
with p = Θ(q) and 1 − p = Θ(1 − q), the best known poly(n) time algorithms all only work above
(p − q)2 k2
&
q(1 − q) n
which is an asymptotic extension of the Kesten-Stigum threshold to general p and q. In contrast, the sta-
tistically optimal rate of recovery is again roughly a factor of k lower at Ω̃(k/n). Furthermore, up to log n
factors, the Kesten-Stigum threshold is both when efficient exact recovery algorithms begin to work and
where the best efficient weak recovery algorithms are conjectured to fail [CX16].
In this work, we show computational lower bounds matching the Kesten-Stigum threshold up to a con-
stant factor in a mean-field analogue of recovering a first community C1 in the k-SBM, where p and q
are bounded away from zero and one. Consider a sample G from the k-SBM restricted to the union
of the other communities C2 , . . . , Ck . This subgraph has average edge density approximately given by
q̂ = (p − q) · (k − 1) · (n/k)2 · (n − n/k)−2 + q = (k − 1)−1 · p + (1 − (k − 1)−1 ) · q. Now consider the
task of recovering the community C1 in the graph G0 in which the subgraph on C2 , . . . , Ck is replaced by
the corresponding mean-field Erdős-Rényi graph G(n − n/k, q̂). Formally, let G0 be the graph formed by
first choosing C1 at random and sampling edges as follows:
• include edges within C1 with probability P11 = p;
• include edges between C1 and [n]\C1 with probability P12 = q; and
• includes edges within [n]\C1 with probability P22 where P22 = (k − 1)−1 · p + (1 − (k − 1)−1 ) · q.
We refer to this model as the imbalanced SBM and let ISBM(n, k, P11 , P12 , P22 ) denote the problem of
testing between this model and Erdős-Rényi graphs of the form G(n, P0 ). As we will discuss in Section
6.3, lower bounds for this formulation also imply lower bounds for weakly and exactly recovering C1 . We
remark that under our notation for ISBM, the hidden community C1 has size n/k and k is the number of
communities in the analogous k-block SBM described above.
As we will discuss in Section 14.1, ISBM can also be viewed as a model of single community detection
with uniformly calibrated expected degrees. Note that the expected degree of a vertex in C1 is nP22 − p
and the expected degree of a vertex in C1 \[n] is (n − 1)P22 , which differ by at most 1. Similar models with
two imbalanced communities and calibrated expected degrees have appeared previously in [NN14, VAC15,
PW17, CLM18]. As will be discussed in Section 3.4, the simpler planted dense subgraph model of single
community recovery has a detection threshold that differs from the Kesten-Stigum threshold, even though
the Kesten-Stigum threshold is conjectured to be the barrier for recovering the planted dense subgraph. This
is because non-uniformity in expected degrees gives rise to simple edge-counting tests that do not lead to
algorithms for recovering the planted subgraph. Our main result for ISBM is the following lower bound up
to the asymptotic Kesten-Stigum threshold.
Theorem 3.2 (Lower Bounds for ISBM). Suppose that (n, k) satisfy condition ( T ), that k is prime or k =
ωn (1) and k = o(n1/3 ), and suppose that q ∈ (0, 1) satisfies min{q, 1 − q} = Ωn (1). If P22 = (k − 1)−1 ·
p + (1 − (k − 1)−1 ) · q, then the k- PC conjecture implies that there is a computational lower bound for
(p−q)2
ISBM (n, k, p, q, P22 ) at all levels of signal below the Kesten-Stigum threshold of q(1−q) = õ(k 2 /n).

This directly provides evidence for the conjecture that (p − q)2 /q(1 − q) = Θ̃(k 2 /n) defines the
computational barrier for community recovery in general k-SBMs made in [CX16]. While the statistical-
computational gaps in PC and k-SBM are the two most prominent conjectured gaps in average-case problems

16
over graphs, they are very different from an algorithmic perspective and evidence for computational lower
bounds up to the Kesten-Stigum threshold has remained elusive. Our reduction yields a first step towards
understanding the relationship between these gaps.

3.3 Testing Hidden Partition Models


We also introduce two testing problems we refer to as the Gaussian and bipartite hidden partition models.
We give a reduction and algorithms that show these problems have a statistical-computational gap, and
we tightly characterize their computational barriers based on the k- PC conjecture. The main motivation
for introducing these problems is to demonstrate the versatility of our reduction technique dense Bernoulli
rotations in transforming hidden structure. A description of dense Bernoulli rotations and the construction
of a key design tensor used in our reduction can be found in Section 8.
The task in the bipartite hidden partition model problem is to test for the presence of a planted rK-
vertex subgraph, sampled from an r-block stochastic block model, in an n-vertex random bipartite graph.
The Gaussian hidden partition model problem is a corresponding Gaussian analogue. These are both multi-
community variants of the subgraph stochastic block model considered in [BBH18], which corresponds to
the setting in which r = 2. The multi-community nature of the planted subgraph yields a more intricate
hidden structure, and the additional free parameter r yields a more complicated computational barrier. The
work of [CX16] considered the related task of recovering the communities in the Gaussian and bipartite
hidden partition models. We remark that conjectured computational limits for this recovery task differ from
the detection limits we consider.
Formally, our hidden partition problems are defined as follows. Let C = (C1 , C2 , . . . , Cr ) and D =
(D1 , D2 , . . . , Dr ) are chosen independently and uniformly at random from the set of all sequences of length
r consisting of disjoint K-subsets of [n]. Consider the random matrix M sampled by first sampling C and
D and then sampling

 N (γ,
  1)  if i ∈ Ch and j ∈ Dh for some h ∈ [r]
γ
Mij ∼ N − r−1 , 1 if i ∈ Ch1 and j ∈ Dh2 where h1 6= h2

N (0, 1) otherwise

independently for each 1 ≤ i, j ≤ n. The problem GHPM(n, r, K, γ) is to test between H0 : M ∼


N (0, 1)⊗n×n and an alternative hypothesis H1 under which M is sampled as outlined above. The problem
BHPM (n, r, K, P0 , γ) is a bipartite graph analogue of this problem with ambient edge density P0 , edge
γ
density P0 + γ within the communities in the subgraph and P0 − r−1 on the rest of the subgraph.
As we will show in Section 14.2, an empirical variance test succeeds above the threshold γcomp 2 =
2 2
Θ̃(n/rK ) and an exhaustive search succeeds above γIT = Θ̃(1/K) in GHPM and BHPM where P0 is
bounded away from 0 and 1. Thus our main lower bounds for these two problems confirm that this empirical
variance test is approximately optimal among efficient algorithms and that both problems have a statistical-
computational gap assuming the k- PC conjecture.
Theorem 3.3 (Lower Bounds for GHPM and BHPM). Suppose that r2 K 2 = ω̃(n) and (dr2 K 2 /ne, r) satis-
fies condition ( T ), suppose r is prime or r = ωn (1) and suppose that P0 ∈ (0, 1) satisfies min{P0 , 1 −
P0 } = Ωn (1). Then the k- PC conjecture implies that there is a computational lower bound for each
of GHPM(n, r, K, γ) for all levels of signal γ 2 = õ(n/rK 2 ). This same lower bound also holds for
BHPM (n, r, K, P0 , γ) given the additional condition n = o(rK 4/3 ).

We also remark that the empirical variance and exhaustive search tests along with our lower bound
do not support the existence of a statistical-computational gap in the case when the subgraph is the entire
graph with n = rK, which is our main motivation for considering this subgraph variant. We remark that a

17
α α

2 2
Community Detection Community Recovery

n2
k4

IT impossible IT impossible

R
SN
1 1
2
3 open
1 1 n
2
 k poly-time  k
 k
SNR PC-hard SNR PC-hard R
SN poly-time
0 β 0 β
1 2 1 1 1
2 3 2

Figure 2: Prior computational and statistical barriers in the detection and recovery of a single hidden community from
(P1 −P0 )2
the PC conjecture [HWX15, BBH18, BBH19]. The axes are parameterized by α and β where SNR = P 0 (1−P0 )
=
Θ̃(n−α ) and k = Θ̃(nβ ). The red region is conjectured to be hard but no PC reductions showing this are known.

number of the technical conditions in the theorem such as condition ( T ) and n = o(rK 4/3 ) are trivial in the
parameter regime where the number of communities is not very large with r = no(1) and when the total size
of the hidden communities is large with rK = Θ̃(nc ) where c > 3/4. In this regime, these problems have a
nontrivial statistical-computational gap that our result tightly characterizes.

3.4 Semirandom Planted Dense Subgraph and the Recovery Conjecture


In the planted dense subgraph model of single community recovery, the observation is a sample from
G(n, k, P1 , P0 ) which is formed by planting a random subgraph on k vertices from G(k, P1 ) inside a copy
of G(n, P0 ), where P1 > P0 are allowed to vary with n and satisfy that P1 = O(P0 ). Detection and
recovery of the hidden community in this model have been studied extensively [ACV14, BI13, VAC15,
HWX15, CX16, HWX16c, Mon15, CC18] and this model has emerged as a canonical example of a prob-
lem with a detection-recovery computational gap. While it is possible to efficiently detect the presence of

a hidden subgraph of size k = Ω̃( n) if (P1 − P0 )2 /P0 (1 − P0 ) = Ω̃(n2 /k 4 ), the best known poly-
nomial time algorithms to recover the subgraph require a higher signal at the Kesten-Stigum threshold of
(P1 − P0 )2 /P0 (1 − P0 ) = Ω̃(n/k 2 ).
In each of [HWX15, BBH18] and [BBH19], it has been conjectured that the recovery problem is hard
below this threshold of Θ̃(n/k 2 ). This PDS Recovery Conjecture was even used in [BBH18] as a hardness
assumption to show detection-recovery gaps in other problems including biased sparse PCA and Gaus-
sian biclustering. A line of work has tightly established the conjectured detection threshold through reduc-
tions from the PC conjecture [HWX15, BBH18, BBH19], while the recovery threshold has remained elusive.
Planted clique maps naturally to the detection threshold in this model, so it seems unlikely that the PC con-
jecture could also yield lower bounds at the tighter recovery threshold, given that recovery and detection are
known to be equivalent for PC [AAK+ 07]. These prior lower bounds and the conjectured detection-recovery
gap in PDS are depicted in Figure 2.
We show that the k- PC conjecture implies the PDS Recovery Conjecture for semirandom community re-
covery in the regime where P0 = Θ(1). Semirandom adversaries provide an alternate notion of robustness
against constrained modifications that heuristically appear to increase the signal strength [BS95]. Algo-
rithms and lower bounds in semirandom problems have been studied for a number of problems, including
the stochastic block model [FK01, MPW16], planted clique [FK00], unique games [KMM11], correlation

18
clustering [MS10, MMV15], graph partitioning [MMV12], 3-coloring [DF16] and clustering mixtures of
Gaussians [VA18]. Formally we consider the problem SEMI - CR(n, k, P1 , P0 ) where a semirandom adver-
sary is allowed to remove edges outside of the planted subgraph from a graph sampled from G(n, k, P1 , P0 ).
The task is to test between this model and an Erdős-Rényi graph G(n, P0 ) similarly perturbed by a semi-
random adversary. As we will discuss in Section 6.3, lower bounds for this formulation extend to approx-
imately recovering the hidden community under a semirandom adversary. In Section 14.3, we prove the
following theorem – that the computational barrier in the detection problem shifts to the recovery threshold
in SEMI - CR.

Theorem 3.4 (Lower Bounds for SEMI - CR). If k and n are polynomial in each other with k = Ω( n)
and 0 < P0 < P1 ≤ 1 where min{P0 , 1 − P0 } = Ω(1), then the k- PC conjecture implies that there is a
−P0 )2
computational lower bound for SEMI - CR(n, k, P1 , P0 ) at P(P01(1−P 0)
= õ(n/k 2 ).
A related reference is the reduction in [CLR15], which proves a detection-recovery gap in the context
of sub-Gaussian submatrix localization based on the hardness of finding a planted k-clique in a random
n/2-regular graph. The relationship between our lower bound and that of [CLR15] is discussed in more
detail in Section 14.3. From an algorithmic perspective, the convexified maximum likelihood algorithm
from [CX16] complements our lower bound – a simple monotonicity argument shows that it continues to
solve the community recovery problem above the Kesten-Stigum threshold under a semirandom adversary.

3.5 Negatively Correlated Sparse Principal Component Analysis


In sparse principal component analysis (PCA), the observations X1 , X2 , . . . , Xn are n independent samples
from N (0, Σ) where the eigenvector v corresponding to the largest eigenvalue of Σ is k-sparse, and the
task is to estimate v in `2 norm or find its support. Sparse PCA has many applications ranging from on-
line visual tracking [WLY13] and image compression [Maj09] to gene expression analysis [ZHT06, CK09,
PTB09, CH10]. Showing lower bounds for sparse PCA can be reduced to analyzing detection in the spiked
covariance model [JL04], which has hypotheses

H0 : X ∼ N (0, Id )⊗n and H1 : X ∼ N (0, Id + θvv > )⊗n

Here, H1 is the composite hypothesis where v ∈ Rd is unknown and allowed to vary over all k-sparse p unit
vectors. The information-theoretically optimal rate of detection
√ is at the level of signal θ = Θ( k log d/n)
[BR13b, CMW15, WBS16]. However, p when k = o( d), the best known polynomial time algorithms
for sparse PCA require that θ = Ω( k 2 /n). Since the seminal paper of [BR13a] initiated the study of
statistical-computational gaps through the PC conjecture, this k-to-k 2 gap for sparse PCA has been shown to
follow from the PC conjecture in a sequence of papers [BR13b, BR13a, WBS16, GMZ17, BBH18, BB19b].
In negatively correlated sparse PCA, the eigenvector v of interest instead corresponds to the small-
est eigenvalue of Σ. Negative sparse PCA can similarly be formulated as a hypothesis testing problem
NEG - SPCA (n, k, d, θ), where the alternative hypothesis is instead given by H1 : X ∼ N (0, Id − θvv > )⊗n .
Similar algorithms as in ordinary sparse PCA continue
p to work in the negative setting – the information-
theoretic limit of
p the problem remains at θ = Θ( k log d/n) and the best known efficient algorithms still
2
require θ = Ω( k /n). However, negative sparse PCA is stochastically a very differently structured prob-
lem than ordinary sparse PCA. A sample from the ordinary spiked covariance model can be expressed as

Xi = θ · gv + N (0, Id )

where g ∼ N (0, 1) is independent of the N (0, Id ) term. This signal plus noise representation is a common
feature in many high-dimensional statistical models and is crucially used in the reductions showing hardness
for sparse PCA in [BR13b, BR13a, WBS16, GMZ17, BBH18, BB19b]. Negative sparse PCA does not admit

19
a representation of this form, making it an atypical planted problem and different from ordinary sparse
PCA, despite the deceiving similarity between their optimal algorithms. The lack of this representation
makes reducing to Negative sparse PCA technically challenging. Negatively spiked PCA was also recently
related to the hardness of finding approximate ground states in the Sherrington-Kirkpatrick model [BKW19].
However, ordinary PCA does not seem to share this connection. In Section 9, we give a reduction obtaining
the following computational lower bound for NEG - SPCA from the BPC conjecture.

Theorem 3.5 (Lower Bounds for NEG - SPCA). If k, d and n are polynomial in each other, k = o( d) and
k = o(n1/6 ), then the pBPC conjecture implies a computational lower bound for NEG - SPCA (n, k, d, θ) at all
levels of signal θ = õ( k 2 /n).

We deduce this theorem and discuss its conditions in detail in Section 13.2. A key step in our reduction
to NEG - SPCA involves randomly rotating the positive semidefinite square root of the inverse of an empirical
covariance matrix. In analyzing this step, we prove a novel convergence result in random matrix theory,
which may be of independent interest. Specifically, we characterize when a Wishart matrix and its inverse
converge in KL divergence. This is where the parameter constraint k = o(n1/6 ) in the theorem above
arises. We believe that this is an artefact of our techniques and extending the theorem to hold without
this condition is an interesting open problem. A similar condition arose in the strong lower bounds of
[BB19b]. We remark that conditions of this form do not affect the tightness of our lower bounds, but
rather only impose a constraint on the level of sparsity k. More precisely, for each fixed level of sparsity
k = Θ̃(nα ),pthere is conjectured statistical-computational gap inpθ between the information-theoretic barrier
of θ = Θ( k log d/n) and computational barrier of θ = õ( k 2 /n). Our reduction tightly establishes
this gap for all α ∈ (0, 1/6]. Our main motivation for considering NEG - SPCA is that it seems to have
a fundamental connection to the structure of supervised problems where ordinary sparse PCA does not.
In particular, our reduction to NEG - SPCA is a crucial subroutine in reducing to mixtures of sparse linear
regressions and robust sparse linear regression. This is discussed further in Sections 4, 9 and 10.

3.6 Unsigned and Mixtures of Sparse Linear Regressions


In learning mixtures of sparse linear regressions (SLR), the task is to learn L sparse linear functions cap-
turing the relationship between features and response variables in heterogeneous samples from L different
sparse regression problems. Formally, the observations (X1 , y1 ), (X2 , y2 ), . . . , (Xn , yn ) are n independent
sample-label pairs given by yi = hβ, Xi i + ηi where Xi ∼ N (0, Id ), ηi ∼ N (0, 1) and β is chosen from a
mixture distribution ν over a finite set k-sparse vectors {β1 , β2 , . . . , βL } of bounded `2 norm. The task is to
estimate the components βj that are sufficiently likely under ν in `2 norm i.e. to within an `2 distance of τ .
Mixtures of linear regressions, also known as the hierarchical mixtures of experts model in the machine
learning community [JJ94], was first introduced in [QR78] and has been studied extensively in the past few
decades [DV89, WD95, MP04, ZZ04, FS10]. Recent work on mixtures of linear regressions has focussed on
efficient algorithms with finite-sample guarantees [CL13, CYC14, YCS14, BWY+ 17, CYC17, LL18]. The
high-dimensional setting of mixtures of SLRs was first considered in [SBVDG10], which proved an oracle
inequality for an `1 -regularization approach, and variants of the EM algorithm for mixtures of SLRs were
analyzed in [WGNL14, YC15]. Recent work has also studied a different setting for mixtures of SLRs where
the covariates Xi can be designed by the learner [YPCR18, KMMP19].
We show that a statistical-computational gap emerges for mixtures of SLRs even in the simplest case
where there are L = 2 components, the mixture distribution ν is known to sample each component with
probability 1/2 and the task is to estimate even just one of the components {β1 , β2 } to within `2 norm τ . We
refer to this simplest setup for learning mixtures of SLRs as MSLR(n, k, d, τ ). The following computational
lower bound is deduced in Section 13.3 and is a consequence of the reduction in Section 10.

20

Theorem 3.6 (Lower Bounds for MSLR). If k, d and n are polynomial in each other, k = o( d) and k =
o(n1/6 ), then the k- BPC conjecture implies that there is a computational lower bound for MSLR(n, k, d, τ )
at all sample complexities n = õ(k 2 /τ 4 ).

As we will discuss in Section 6.3, we will prove this theorem by reducing to the problem of testing
between the mixtures of SLRs model when β1 = −β2 and a null hypothesis under which y and X are
independent. A closely related work [FLWY18] studies a nearly identical testing problem in the statistical
query model. They tightly characterize the information-theoretic limit of this problem, showing that it occurs
at the sample complexity n = Θ̃(k log d/τ 4 ). Therefore our reduction establishes a k-to-k 2 statistical-
computational gap in this model of learning mixtures of SLRs. In [FLWY18], it is also shown that efficient
algorithms in the statistical query model suffer from this same k-to-k 2 gap.
Our reduction to the hypothesis testing formulation of MSLR above is easily seen to imply that the same
computational lower bound holds for an unsigned variant USLR(n, k, d, τ ) of SLR, where the n observations
(X1 , y1 ), (X2 , y2 ), . . . , (Xn , yn ) now of the form yi = |hβ, Xi i + ηi | for a fixed unknown β. Note that
by the symmetry of N (0, 1), yi is equidistributed to ||hβ, Xi i| + ηi | and thus is a noisy observation of
|hβ, Xi i|. In general, noisy observations of the phaseless modulus |hβ, Xi i| from some conditional link
distribution P(· | |hβ, Xi i|) yields a general instance of phase retrieval [MM18a, CMW20]. As observed
in [FLWY18], the problem USLR is close to the canonical formulation of sparse phase retrieval (SPR)
where P(· | |hβ, Xi i|) is N (|hβ, Xi i|2 , σ 2 ), which has been studied extensively and has a conjectured k-
to-k 2 statistical-computational gap [LV13, SR14, CLS15, CLM16, WZG+ 17, HLV18, BKM+ 19, CMW20].
Our lower bounds provide partial evidence for this conjecture and it is an interesting open problem to give a
reduction to the canonical formulation of SPR and other sparse GLMs through average-case reductions.
The reduction to MSLR showing Theorem 3.6 in Section 10 is our capstone reduction. It showcases
a wide range of our techniques including dense Bernoulli rotations, constructions of combinatorial design
matrices from Ftr , our reduction to NEG - SPCA and its connection to random matrix theory, and an additional
technique of combining instances of different unsupervised problems into a supervised problem. We give
an overview of these techniques in Section 4. Furthermore, MSLR is a very differently structured problem
from any of our variants of PC and it is surprising that the tight statistical-computational gap for MSLR can
be derived from their hardness. We remark that our lower bounds for MSLR inherit the technical condition
that k = o(n1/6 ) from our reduction to NEG - SPCA. As before, this does not affect the fact that we show
tight hardness and it is an interesting open problem to remove this condition.

3.7 Robust Sparse Linear Regression


In ordinary SLR, the observations (X1 , y1 ), (X2 , y2 ), . . . , (Xn , yn ) are independent sample-label pairs given
by yi = hβ, Xi i + ηi where Xi ∼ N (0, Σ), ηi ∼ N (0, 1) and β is an unknown k-sparse vector with
bounded `2 norm. The task is to estimate β to within `2 norm τ . When Σ is well-conditioned, SLR is a
gapless problem with the computationally efficient LASSO attaining the information-theoretically optimal
sample complexity of n = Θ(k log d/τ 2 ) [Tib96, BRT09, RWY10]. When Σ is not well-conditioned, SLR
has a statistical-computational gap based on its restricted eigenvalue constant [ZWJ14]. As with robust
sparse mean estimation, the robust SLR problem RSLR(n, k, d, τ, ) is obtained when a computationally-
unbounded adversary corrupts an arbitrary -fraction of the observed sample-label pairs. In this work, we
consider the simplest case of Σ = Id where SLR is gapless but, as we discuss next, robustness seems to
induce a statistical-computational gap.
Robust regression is a well-studied classical problem in statistics [RL05]. Efficient algorithms remained
elusive for decades, but recent breakthroughs in sum of squares algorithms [KKM18, KKK19, RY20], fil-
tering approaches [DKS19] and robust gradient descent [CSX17, PSBR18, DKK+ 19] have led to the first
efficient algorithms with provable guarantees. A recent line of work has also studied efficient algorithms and

21
barriers in the high-dimensional setting of robust SLR [CCM13, BDLS17, LSLC18, LLC19]. Even in the
simplest case of Σ = Id where the covariates Xi have independent entries, the best known polynomial time
algorithms suggest robust SLR has a k-to-k 2 statistical-computational gap. As shown in [Gao20], similar
to RSME, robust SLR is only information-theoretically possible if τ = Ω(). In [BDLS17, LSLC18], it is
shown that polynomial-time ellipsoid-based algorithms solve robust SLR with n = Θ̃(k 2 log d/2 ) samples
when τ = Θ̃(). Furthermore, [LSLC18] shows that an RSME oracle can be used to solve robust SLR with
only a Θ̃(1) factor loss in τ and the required number of samples n. As noted in [Li17], n = Ω(k log d/2 )
samples suffice to solve RSME inefficiently when τ = Θ(). Combining these observations yields an in-
efficient algorithm for robust SLR with sample complexity n = Θ̃(k log d/2 ) samples when τ = Θ̃(),
confirming that the best known efficient algorithms suggest a k-to-k 2 statistical-computational gap. In
[CCM13, √ LLC19], efficient √algorithms are shown to succeed in an alternative regime where n = Θ̃(k log d),
 = Õ(1/ k) and τ = Õ( k).
All of these algorithms suggest that the correct computational sample complexity for robust SLR is
n = Ω̃(k 2 2 /τ 4 ). In Section 13.3, we deduce the following tight computational lower bound for RSLR
providing evidence for this conjecture.

Theorem
√ 3.7 (Lower Bounds for RSLR). If k, d and n are polynomial in each other, k = o(n1/6 ), k =
o( d) and  < 1/2 is such that  = Ω̃(n−1/2 ), then the k- BPC conjecture implies that there is a computa-
tional lower bound for RSLR(n, k, d, τ, ) at all sample complexities n = õ(k 2 2 /τ 4 ).

We present the reductions to MSLR and RSLR together as a single unified reduction k- PDS - TO - MSLR
in Section 10. As is discussed in Section 13.3, MSLR and RSLR are obtained by setting r = −1 =
2 and  < 1/2, respectively. The theorem above follows from a slightly modified version of this re-
duction, k- PDS - TO - MSLRR , that removes the technical condition ( T ) that otherwise arises in applying
k- PDS - TO - MSLR with r = nΩ(1) . This turns out to be more important here than in the context of RSME be-
cause, as in the reduction to MSLR, this reduction to RSLR inherits the technical condition that k = o(n1/6 )
from our reduction to NEG - SPCA. This condition implicitly imposes a restriction on  to satisfy that
 = Õ(n−1/3 ), since τ = Ω() must be true for the problem to not be information-theoretically impos-
sible. Thus our regime of interest for RSLR is a regime where the technical condition ( T ) is nontrivial.
As in the case of MSLR and NEG - SPCA, we emphasize that the condition k = o(n1/6 ) does not affect the
tightness of our lower bounds, merely restricting their regime of application. In particular, the theorem above
yields a tight nontrivial statistical-computational gap in the entire parameter regime when k = o(n1/6 ),
τ = Ω() and  = Θ̃(n−c ) where c is any constant in the interval [1/3, 1/2]. We remark that the condition
k = o(n1/6 ) seems to be an artefact of our techniques rather than necessary.
In the context of RSLR, we view our main contribution as a set of reduction techniques relating PCρ
to the very differently structured problem RSLR, rather than the resulting computation lower bound itself.
A byproduct of our reduction is the explicit construction of an adversary modifying an -fraction of the
samples in robust SLR that produces the k-to-k 2 statistical-computational gap in the theorem above. This
adversary turns out to be surprisingly nontrivial on its own, but is a direct consequence of the structure of
the reduction. This is discussed in more detail in Sections 10.2 and 13.3.

3.8 Tensor Principal Component Analysis


In Tensor PCA, the observation is a single order s tensor T with dimensions n⊗s = n × n × · · · × n given
⊗s
by T ∼ θv ⊗s + N (0, 1)⊗n , where v has a Rademacher prior and is distributed uniformly over {−1, 1}n

[RM14]. The task is to recover v within nontrivial `2 error o( n) and is only information-theoretically
(1−s)/2 [RM14, LML 17, CHL18, JLM18, Che19, PWB+ 20], in which case v can be
+

possible if θ = ω̃ n
recovered through exhaustive search. The best known polynomial-time algorithms all require the higher
signal strength θ = Ω̃(n−s/4 ), at which point v can be recovered through spectral algorithms [RM14],

22
the sum of squares hierarchy [HSS15, HSSS16] and spectral algorithms based on the Kikuchi hierarchy
[WEAM19]. Lower bounds up to this conjectured computational barrier have been shown in the sum of
squares hierarchy [HSS15, HKP+ 17] and for low-degree polynomials [KWB19]. A number of natural “lo-
cal” algorithms have also been shown to fail given much stronger levels of signal up to θ = õ(n−1/2 ),
including approximate message passing, the tensor power method, Langevin dynamics and gradient descent
[RM14, AGHK14, BGJ18].
We give a reduction showing that the PCρ conjecture implies an optimal computational lower bound
at θ = Ω̃(n−s/4 ) for tensor PCA. We show this lower bound against efficient algorithms with a low false
⊗s
positive probability of error in the hypothesis testing formulation of tensor PCA where T ∼ N (0, 1)⊗n
under H0 and T is sampled from the tensor PCA distribution described above under H1 . More precisely, we
prove the following theorem in Sections 11 and 15.

Theorem 3.8 (Lower Bounds for TPCA). Let n be a parameter and s ≥ 3 be a constant, then the k- HPCs
conjecture implies a computational lower bound for TPCAs (n, θ) when θ = õ(n−s/4 ) against poly(n) time
algorithms A solving TPCAs (n, θ) with a low false positive probability of PH0 [A(T ) = H1 ] = O(n−s ).

Lemma 15.4 in Section 15 shows that any poly(n) time algorithm solving the recovery formulation of
tensor PCA yields such an algorithm A, and thus this theorem implies our desired computational lower
bound. This low false positive probability of error condition on A arises from the fact that our reduction to
TPCA is a multi-query average-case reduction, requiring multiple calls to a tensor PCA blackbox to solve
k- HPCs . This feature is a departure from the rest of our reductions and the other average-case reductions
to statistical problems in the literature, all of which are reductions in total variation, as will be described
in Section 6.2, and thus only require a single query. This feature is a requirement of our technique for
completing hypergraphs that will be described further in Sections 4.6 and 11.
We note that most formulations of tensor PCA in the literature also assume that the noise tensor of
standard Gaussians is symmetric [RM14, WEAM19]. However, given that the planted rank-1 component
v ⊗s is symmetric as it is in our formulation, the symmetric and asymmetric noise models have a simple
equivalence up to a constant factor loss in θ. Averaging the entries of the asymmetric model over all permu-
tations of its s coordinates shows one direction of this equivalence, and the other is achieved by reversing
this averaging procedure through Gaussian cloning as in Section 10 of [BBH18]. A closely related work
is that of [ZX17], which gives a reduction from HPC3 to the problem of detecting a planted rank-1 compo-
nent in a 3-tensor of Gaussian noise. Aside being obtained through different techniques, their result differs
from ours in two ways: (1) the rank-1 components they considered were sparse, rather than sampled from a
Rademacher prior; and (2) their reduction necessarily produces asymmetric rank-1 components. Although
the limits of tensor PCA when s ≥ 3 with sparse and Rademacher priors are similar, they can be very
different in other problems. For example, in the matrix case when s = 2, a sparse prior yields a problem
with a statistical-computational gap while a Rademacher prior does not. We also remark that ensuring the
symmetry of the planted rank-1 component is a technically difficult step and part of the motivation for our
completing hypergraphs technique in Section 11.

3.9 Universality for Learning Sparse Mixtures


When  = 1/2, our reduction to robust sparse mean estimation also implicitly shows tight computa-
tional lower bounds at n = õ(k 2 /τ 4 ) for learning sparse Gaussian mixtures. In this problem the task
is to estimate two vectors µ1 , µ2 up to `2 error τ , where the µi have bounded `2 norms and a k-sparse
difference µ1 − µ2 , given samples from an even mixture of N (µ1 , Id ) and N (µ2 , Id ). In general, learn-
ing in Gaussian mixture models with sparsity has been studied extensively over the past two decades
[RD06, PS07, MCMM09, MM11, ASW13, ASW15, MWFSG16, VAC17, FLWY18]. Recent work has es-
tablished finite-sample guarantees for efficient and inefficient algorithms and proven information-theoretic

23
lower bounds for the two-component case [ASW13, VAC17, FLWY18]. These works conjectured that this
problem has the k-to-k 2 statistical-computational gap shown by our reduction. In [FLWY18], a tight com-
putational lower bound matching ours was established in the statistical query model.
So far, despite having a variety of different hidden structures, the problems we have considered have
all had either Gaussian or Bernoulli noise distributions. As we will describe in Section 4, our techniques
also crucially use a number of properties of the Gaussian distribution. This naturally raises the question:
do our techniques have implications beyond simple noise distributions? Our final reduction answers this
affirmatively, showing that our lower bound for learning sparse Gaussian mixtures implies computational
lower bounds for a wide universality class of noise distributions. This lower bound includes the optimal gap
in learning sparse Gaussian mixtures and the optimal gaps in [BR13b, BR13a, WBS16, GMZ17, BBH18]
for sparse PCA as special cases. This reduction requires introducing a new type of rejection kernel, that we
refer to as symmetric 3-ary rejection kernels, and is described in Sections 4.7 and 7.3.
In Section 16, we show computational lower bounds for the generalized learning sparse mixtures prob-
lem GLSM. In GLSM(n, k, d, U) where U = (D, Q, {Pν }ν∈R ), the elements of the family {Pν }ν∈R and Q
are distributions on a measurable space, such that the pairs (Pν , Q) all satisfy mild conditions permitting
efficient computation outlined in Section 7.3, and D is a mixture distribution on R. The observations in
GLSM are n independent samples X1 , X2 , . . . , Xn formed as follows:

• for each sample Xi , draw some latent variable νi ∼ D and

• sample (Xi )j ∼ Pνi if j ∈ S and (Xi )j ∼ Q otherwise, independently

where S is some unknown subset containing k of the d coordinates. The task is to recover S or distinguish
from an H0 in which all of the data is drawn i.i.d. from Q. Given a collection of distributions U, we define
U to be in our universality class UC(N ) with level of signal τU if it satisfies the following conditions.

Definition 3.9 (Universality Class and Level of Signal). Given a parameter N , define the collection of
distributions U = (D, Q, {Pν }ν∈R ) implicitly parameterized by N to be in the universality class UC(N ) if

• the pairs (Pν , Q) are all computable pairs, as in Definition 7.6, for all ν ∈ R;

• D is a symmetric distribution about zero and Pν∼D [ν ∈ [−1, 1]] = 1 − o(N −1 ); and

• there is a level of signal τU ∈ R such that for all ν ∈ [−1, 1] such that for any fixed constant K > 0,
it holds that

dPν dP −ν dPν dP −ν 2

dQ (x) − dQ (x) = ON (τU ) and dQ (x) + dQ (x) − 2 = ON τU

with probability at least 1 − O N −K over each of Pν , P−ν and Q.




Our main result establishes a computational lower bound for GLSM instances with U ∈ UC(n) in terms
of the level of signal τU . As mentioned above, this theorem implies optimal lower bounds for learning sparse
mixtures of Gaussians, sparse PCA and many more natural problem formulations described in Section 16.2.

Theorem 3.10 (Computational


√ Lower Bounds for GLSM). Let n, k and d be polynomial in each other and
such that k = o( d). Suppose that the collections of distributions U = (D, Q, {Pν }ν∈R ) is in UC(n). Then
the k- BPC conjecture implies a computational lower bound for GLSM (n, k, d, U) at all sample complexities
n = õ τU−4 .


24
4 Technical Overview
We now outline our main technical contributions and the central ideas behind our reductions. These tech-
niques will be formally introduced in Part II and applied in our problem-specific reductions to deduce our
main theorems stated in the previous section in Part III.

4.1 Rejection Kernels


Rejection kernels are a reduction primitive introduced in [BBH18, BBH19] for algorithmic changes of mea-
sure. Related reduction primitives for changes of measure to Gaussians and binomial random variables
appeared earlier in [MW15b, HWX15]. Given two input Bernoulli probabilities 0 < q < p ≤ 1, a rejection
kernel simultaneously maps Bern(p) and Bern(q) approximately in total variation to samples from two ar-
bitrary distributions P and Q. Note that in this setup, the rejection kernel primitive is oblivious to whether
the true distribution of its input is Bern(p) or Bern(q). The main idea behind rejection kernels is that, under
suitable conditions on P and Q, this can be achieved through a rejection sampling scheme that samples
x ∼ Q and rejects with a probability that depends on x and on whether the input was 0 or 1. Rejection
kernels are discussed in more depth in Section 7. In this work, we will need the following two instantiations
of the framework developed in [BBH18, BBH19]:

• Gaussian Rejection Kernels: Rejection kernels mapping


 q Bern(p)  and Bern(q) to within O(RRK ) total
−1
variation of N (µ, 1) and N (0, 1) where µ = Θ 1/ log Rrk and p, q are fixed constants.

• Bernoulli Cloning: A rejection kernel mapping Bern(p) and Bern(q) exactly to Bern(P )⊗t and
Bern(Q)⊗t where
1−P t
   t
1−p P p
≤ and ≤
1−q 1−Q Q q

By performing computational changes of measure, these primitives are crucial in mapping to desired dis-
tributional aesthetics. However, they also play an important role in transforming hidden structure. Gaus-
sian rejection kernels grant access to an arsenal of measure-preserving transformations of high-dimensional
Gaussian vectors for mapping between different hidden structures while preserving independence in the
noise distribution. Bernoulli cloning is crucial in removing the symmetry in adjacency matrices of PC in-
stances and adjacency tensors of HPC instances, as in the T O -S UBMATRIX procedure in [BBH19]. We
introduce a k-partite variant of this procedure that maps the adjacency matrix of k- PDS to a matrix of in-
dependent Bernoulli random variables while respecting the constraint that there is one planted entry per
block of the k-partition. This procedure is discussed in more detail in Section 4.6 and will serve as a crucial
preprocessing step for dense Bernoulli rotations, which involves taking linear combinations of functions of
entries of this matrix that crucially must be independent.

4.2 Dense Bernoulli Rotations


This technique is introduced in Section 8 and is one of our main primitives for transforming hidden structure
that will be applied repeatedly throughout our reductions. Let PB(n, i, p, q) denote the planted bit distribu-
tion over V ∈ {0, 1}n with independent entries satisfying that Vj ∼ Bern(q) unless j = i, in which case
Vi ∼ Bern(p). Given an input vector V ∈ {0, 1}n , the goal of dense Bernoulli rotations is to output a vector
V 0 ∈ Rm such that, for each i ∈ [n], V 0 is close in total variation to N (c · Ai , Im ) if V ∼ PB(n, i, p, q).
Here, A1 , A2 , . . . , An ∈ Rm are a given sequence of target mean vectors, p and q are fixed constants and c

25
is a scaling factor with c = Θ̃(1). The reduction must satisfy these approximate Markov transition condi-
tions oblivious to the planted bit i and also preserve independent noise, by mapping Bern(q)⊗n to N (0, Im )
approximately in total variation.
Let A ∈ Rm×n denote the matrix with columns A1 , A2 , . . . , An . If the rows of A are orthogonal unit
vectors, then the goal outlined above can be achieved using the isotropy of the distribution N (0, In ). More
precisely, consider the reduction that form V1 ∈ Rn by applying Gaussian rejection kernels entrywise to V
and then outputs AV1 . If V ∼ PB(n, i, p, q), then the rejection kernels ensure that V1 is close in total variation
to N (µ · 1i , In ) and thus V 0 = AV1 is close to N (µ · Ai , Im ). However, if the rows of A are not orthogonal,
then the entries of the output are potentially very dependent and have covariance matrix AA> instead of Im .
This can be remedied by adding a noise-correction term to the output: generate U ∼ N (0, Im ) and instead
output
 1/2
V 0 = λ−1 · AV1 + Im − λ−2 · AA> ·U
1/2
where λ is an upper bound on the largest singular value of A and Im − λ−2 AA> is the positive
semidefinite square root of Im − λ−2 · AA> . If V ∼ PB(n, i, p, q), it now √ follows that V 0 is close in
−1
total variation to N (µλ · Ai , Im ) where µ can be taken to be µ = Θ(1/ log n). This reduction also
preserves independent noise, mapping Bern(q)⊗n approximately to N (0, Im ).
Dense Bernoulli rotations thus begin with a random vector of independent entries and one unknown
elevated bit and produce a vector with independent entries and an unknown elevated pattern from among
an arbitrary prescribed set A1 , A2 , . . . , An . Furthermore, the dependence of the signal strength µλ−1 in
the output instance V 0 on these A1 , A2 , . . . , An is entirely through the singular values of A. This yields
a general structure-transforming primitive that will be used throughout our reductions. Each such use will
consist of many local applications of dense Bernoulli rotations that will be stitched together to produce a
target distribution. These local applications will take three forms:

• To Rows Restricted to Column Parts: The adjacency matrix of k- BPC consists of kn km blocks each
consisting of the edge indicators in Ei × Fj for each pair of the parts Ei , Fj from the given partitions
of [n] and [m]. In our reductions to robust sparse mean estimation, mixtures of SLRs, robust SLR and
universality for learning sparse mixtures, we apply dense Bernoulli rotations separately to each row
in each of these blocks.

• To Vectorized Adjacency Matrix Blocks: In our reductions to dense stochastic block models, testing
hidden partition models and semirandom single community detection, we first pre-process the adja-
cency matrix of k- PC with T O -k-PARTITE -S UBMATRIX. We then apply dense Bernoulli rotations
2
to Rh vectorizations of each h × h block in this matrix corresponding a pair of parts in the given
partition i.e. of the form Ei × Ej .

• To Vectorized Adjacency Tensor Blocks: In our reduction to tensor PCA with order s, after com-
s
pleting the adjacency tensor of the input k- HPC instance, we apply dense Bernoulli rotations to Rh
vectorizations of each h × h × · · · × h block corresponding to an s-tuple of parts.

We remark that while dense Bernoulli rotations heavily rely on distributional properties of isotropic Gaussian
vectors, their implications extend far beyond statistical problems with Gaussian noise. Entrywise threshold-
ing produces planted graph problems and we will show that multiple thresholds followed by applying 3-ary
symmetric rejection kernels maps to a large universality class of noise distributions. These applications of
dense Bernoulli rotations generally reduce the problem of transforming hidden structure to a constrained
combinatorial construction problem – the task of designing a set of mean output vectors A1 , A2 , . . . , An
that have nearly orthogonal rows and match the combinatorial structure in the target statistical problem.

26
4.3 Design Matrices and Tensors
Design Matrices. To construct these vectors A1 , A2 , . . . , An for our applications of dense Bernoulli ro-
tations, we introduce several families of matrices based on the incidence geometry of finite fields. In our
reduction to robust sparse mean estimation, we will show that the adversary that corrupts an -fraction of the
samples by resampling them from N (−c · µ, Id ) produces the desired k-to-k 2 statistical-computational gap.
This same adversarial construction was used in [DKS17]. Here, µ ∈ Rd denotes the k-sparse mean of inter-
est. As will be further discussed at the beginning of Section 8, on applying dense Bernoulli rotations to rows
restricted to parts of the partition of column partition, our desiderata for the mean vectors A1 , A2 , . . . , An
reduce to the following:
• A contains two distinct values {x, y}, and an 0 -fraction of each column is y where  ≥ 0 = Θ();
• the rows of A are unit vectors and nearly orthogonal with λ = O(1); and
• A is nearly an isometry as a linear transformation from Rn → Rm .
The first criterion above is enough to ensure the correct distributional aesthetics and hidden structure in the
output of our reduction. The second and third criteria turn out to be necessary and sufficient for the reduction
to show tight computational lower bounds up to the conjectured barrier of n = õ(k 2 2 /τ 4 ). We remark that
the third criterion also is equivalent to m = Θ̃(n) given the second. Thus our task is to design nearly square,
nearly orthogonal matrices containing two distinct entries with an 0 -fraction of one present in each column.
Note that if  = 1/2, this is exactly achieved by Hadamard matrices. For  < 1/2, our desiderata are nearly
met by the following natural generalization of Hadamard matrices that we introduce. Note that the rows of
a Hadamard matrix can be generated as a reweighted incidence matrix between the hyperplanes and points
t −1
of Ft2 . Let r be a prime number with −1 ≤ r = O(−1 ) and consider the ` × rt matrix A where ` = rr−1
with entries given by 
1 1 if Pj 6∈ Vi
Aij = p ·
rt (r − 1) 1 − r if Pj ∈ Vi
where V1 , V2 , . . . , V` is an enumeration of the (t − 1)-dimensional subspaces of Ftr and P1 , P2 , . . . , Prt is
an enumeration of the points in Ftr . This construction nearly meets our three criteria, with one minor issue
that the column corresponding to 0 ∈ Ftr only contains one entry. A more serious issue is that ` = Θ(rt−1 )
and A is far from an isometry if r  1, which leads to a suboptimal computational lower bound for RSME.
These issues are both remedied by adding in additional rows for all affine shifts of the hyperplanes
t
pmatrix has dimensions r` × r and, although its rows are no longer orthogonal,
V1 , V2 , . . . , V` . The resulting
−1
its largest singular value is 1 + (r − 1) . The resulting matrix Kr,t is used in our applications of dense
Bernoulli rotations to reduce to robust sparse mean estimation, mixtures of SLRs, robust SLR and to show
universality for learning sparse mixtures. Note that for any two rows ri and rj of Kr,t , the outer product
ri rj> is a zero-centered mean adjacency matrix of an imbalanced 2-block stochastic block model. This
observation suggests that the Kronecker product Kr,t ⊗ Kr,t can be used in dense Bernoulli rotations to
map to these SBMs. Surprisingly, this overall reduction yields tight computational lower bounds up to
the Kesten-Stigum threshold for dense SBMs, and using the matrix (K3,t ⊗ Is ) ⊗ (K3,t ⊗ Is ) yields tight
computational lower bounds for semirandom single community detection. We remark that, in this case, it is
again crucial that Kr,t is approximately square – if the matrix A defined above were used in place of Kr,t ,
our reduction would show a lower bound suboptimal to the Kesten-Stigum threshold by a factor of r. Our
reduction to order s tensor PCA applies dense Bernoulli rotations to vectorizations of each tensor block with
the sth order Kronecker product K2,t ⊗ K2,t ⊗ · · · ⊗ K2,t . We remark that these instances of K2,t in this
Kronecker product could be replaced by Hadamard matrices in dimension 2t .
In Section 8.4, we introduce a natural alternative to Kr,t – a random matrix Rn, that approximately
satisfies the three desiderata above. In our reductions to RSME and RSLR, this random matrix has the

27
advantage of eliminating the number-theoretic condition ( T ) arising from applying dense Bernoulli rotations
with Kr,t , which has nontrivial restrictions in the very small  regime when  = n−Ω(1) . However, the
approximate properties of Rn, are insufficient to map exactly to our formulations of ISBM, SEMI - CR, GHPM
and BHPM, where the sizes of the hidden communities are known. A more detailed comparison of Kr,t and
Rn, can be found in Section 8.4. The random matrix Rn, is closely related to the adjacency matrices of
sparse random graphs, and establishing λ = O(1) requires results on their spectral concentration from the
literature. For a consistent and self-contained exposition, we present our reductions with Kr,t , which has a
comparatively simple analysis, and only outline extensions of our reductions using Rn, .

Design Tensors. Our final reduction using dense Bernoulli rotations is to testing hidden partition models.
This reduction requires a more involved construction for A that we only sketch here and defer a detailed
discussion to Section 8.3. Again applying dense Bernoulli rotations to vectorizations of each block of the
input k- PC instance, our goal is to construct a tensor Tr,t such that each slice has the same block structure
as an r-block SBM and the slices are approximately orthogonal under the matrix inner product. A natural
construction is as follows: index each slice by a pair of hyperplanes (Vi , Vj ), label the rows and columns of
each slice by Ftr and plant r communities on the entries with indices in (Vi + aui ) × (Vj + auj ) for each
a ∈ Fr . Here ui and uj are arbitrary vectors not in Vi and Vj , respectively, and thus Vi + aui ranges over
all affine shifts of Vi for a ∈ Fr . An appropriate choice of weights x and y on and off of these communities
yields slices that are exactly orthogonal.
However, this construction suffers from the same issue as the construction of A above – there are
O(r2t−2 ) slices each of which has r2t entries, making the matrix formed by vectorizing the slices of this
tensor far from square. This can be remedied by creating additional slices further indexed by a nonconstant
linear function L : Fr → Fr such that communities are now planted on (Vi + aui ) × (Vj + L(a) · uj ) for
each a ∈ Fr . There are r(r − 1) such linear functions L, making the vectorization of this p tensor nearly
square. Furthermore, it is shown in Section 8.3 that this matrix has largest singular value 1 + (r − 1)−1 .
We remark that this property is quite brittle, as substituting other families of bijections for L can cause this
largest singular value to increase dramatically. Taking the Kronecker product of each slice of this tensor Tr,t
with Is now yields the family of matrices used in our reduction to testing hidden partition models.
We remark that in all of these reductions with both design matrices and design tensors, dense Bernoulli
rotations are applied locally within the blocks induced by the partition accompanying the PCρ instance. In
all cases, our constructions ensure that the fact that the planted bits within these blocks take the form of a
submatrix is sufficient to stitch together the outputs of these local applications of dense Bernoulli rotations
into a single instance with the desired hidden structure. While we did not discuss this constraint in choosing
the design matrices A for each of our reductions, it will be a key consideration in the proofs throughout
this work. Surprisingly, the linear functions L in the construction of Tr,t directly lead to a community
alignment property proven in Section 8.3 that allow slices of this tensor to be consistently stitched together.
Furthermore, we note that unlike Kr,t , the tensor Tr,t does not seem to have a random matrix analogue that
is tractable to bound in spectral norm.

Parameter Correspondence with Dense Bernoulli Rotations. In several of our reductions using dense
Bernoulli rotations, a simple heuristic predicts our computational lower bound in the target problem. Let
X be a data tensor, normalized and centered so that each entry has mean zero and variance 1, and then
consider the `2 norm of the expected tensor E[X]. Our applications of rejection kernels typically preserve
this `2 norm up to polylog(n) factors. Since our design matrices are approximate isometries, most of our
applications of dense Bernoulli rotations also approximately preserve this `2 norm. Thus comparing the `2
norms of the input PCρ instance and output instance in our reductions yields a heuristic for predicting the
resulting computational lower bound. For example, our adversary in RSME produces a matrix E[X] ∈ Rd×n

28
consisting of columns of the form τ · k −1/2 · 1S and −1 (1 − )τ · p k −1/2 · 1S , up to constant factors
where S is the hidden support of µ. The `2 norm of this matrix is Θ(τ n/). The `2 norm of the matrix
E[X] corresponding to the starting k- BPC instance can be verified to be just below o(k 1/2 n1/4 ), when
the k- BPC instance is nearly at its computational barrier. Equating these two `2 norms yields the relation
n = Θ(k 2 2 /τ 4 ), which is exactly our computational barrier for RSME. Similar heuristic derivations of
our computational barriers are produced for ISBM, GHPM, BHPM, SEMI - CR and TPCA at the beginnings of
Sections 14 and 15. We remark that for some of our problems with central steps other than dense Bernoulli
rotations, such as MSLR, RSLR and GLSM, this heuristic does not apply.

4.4 Decomposing Linear Regression and Label Generation


Our reductions to mixtures of SLRs and robust SLR in Section 10 are motivated by the following simple
initial observation. Suppose (X, y) is a single sample from unsigned SLR with y = γR · hv, Xi + N (0, 1)
where R ∈ {−1, 1} is a Rademacher random variable, v ∈ Rd is a k-sparse unit vector, X ∼ N (0, Id ) and
γ ∈ (0, 1). A standard conditioning property of Gaussian vectors yields that the conditional distribution of
X given R and y is another jointly Gaussian vector, as shown below. Our observation is that this conditional
distribution can be decomposed into a sum of our adversarial construction for robust sparse mean estimation
and an independent instance of negative sparse PCA. More formally, we have that

γ2
 
Rγ · y >
X|R, y ∼ N · v, Id − · vv
1 + γ2 1 + γ2
1 1  
∼ √ · N (Rτ · v, Id ) + √ · N 0, Id − θvv >
2 2
| {z } | {z }
Our RSME adversary with  = 1/2 Negative Sparse PCA

√ 2
γ 2 2γ 2
where τ = τ (y) = 1+γ 2 · y and θ = 1+γ 2 . Note that the marginal distribution of y is N (0, 1 + γ ) and

thus it typically holds that |y| = Θ(1). When this unsigned SLR instance
p is at its computational barrier of
n = Θ̃(k 2 /γ 4 ) and |y| = Θ(1), then n = Θ̃(k 2 /τ 4 ) and θ = Θ̃( k 2 /n). Therefore surprisingly, both of
the RSME and NEG - SPCA in the decomposition above are also at their computational barriers.
Now consider task of instead reducing from k- BPC to the problem of estimating v from n independent
samples from the conditional distribution L(X| |y| = 1). In light of the observations above, it suffices to
first use Bernoulli cloning to produce two independent copies of k- BPC, reduce these two copies as outlined
below and then take the sum of the two outputs of these reductions.

• Producing Our RSME Adversary: One of the two copies of k- BPC should be mapped to a tight instance
of our adversarial construction for RSME with  = 1/2 through local applications of dense Bernoulli
rotations with design matrix Kr,t or Rn, , as described previously.

• Producing NEG - SPCA: The other copy should be mapped to a tight instance of negative sparse PCA.
This requires producing negatively correlated data from positively correlated data, and will need new
techniques that we discuss next.

We remark that while these two output instances must be independent, it is important that they share the
same latent vector v. Bernoulli cloning ensures that the two independent copies of k- PC have the same
clique vertices and thus the output instances have this desired property.
This reduction can be extended to reduce to the true joint distribution of (X, y) as follows. Consider
replacing each sample X1 of the output RSME instance by
p
X2 = cy · X1 + 1 − c2 y 2 · N (0, Id )

29
where c is some scaling factor and y is independently sampled from N (0, 1 + γ 2 ), truncated to lie in the
interval [−T, T ] where cT ≤ 1. Observe that if X1 ∼ N (Rτ ·v, Id ), then X2 ∼ N (cRτ y·v, Id ) conditioned
on y. In Section 10.2, we show that a suitable choice of c, T and tweaking τ in the reduction above tightly
maps to the desired distribution of mixtures of SLRs. Analogous observations and performing the RSME sub-
reduction with  < 1/2 can be used to show tight computational lower bounds for robust SLR. We remark
that this produces a more complicated adversarial construction for robust SLR that may be of independent
interest. The details of this adversary can be found in Section 10.2.

4.5 Producing Negative Correlations and Inverse Wishart Matrices


To complete our reductions to mixtures of SLRs and robust SLR, it suffices to give a tight reduction from
k- BPC to NEG - SPCAp . Although NEG - SPCA and ordinary SPCA share the same conjectured computational
barrier at θ = Θ( k 2 /n) and can be solved by similar efficient algorithms above this barrier, as stochastic
models, the two are very different. As discussed in Section 3.5, ordinary SPCA admits a signal plus noise
representation while NEG - SPCA does not. This representation was crucially used in prior reductions show-
ing optimal computational lower bounds for SPCA in [BR13b, BR13a, WBS16, GMZ17, BBH18, BB19b].
Furthermore, the planted entries in a NEG - SPCA sample are negatively correlated. In contrast, the edge indi-
cators of PCρ are positively correlated and all prior reductions from PC have only produced hidden structure
that is also positively correlated.
We first simplify the task of reducing to NEG - SPCA with an observation used in the reduction to SPCA
in [BB19b]. Suppose that n ≥ m + 1 and let m be such that m/k 2 tends slowly to infinity. If X is an m × n
matrix with columns X1 , X2 , . . . , Xn ∼i.i.d. N (0, Σ) where Σ ∈ Rm×m is positiveP semidefinite, then the
conditional distribution of X given its rescaled empirical covariance matrix Σ̂ = ni=1 Xi Xi> is Σ̂1/2 R
where R is an independent m × n matrix sampled from Haar measure on the Stiefel manifold. This implies
that it suffices to reduce to Σ̂ in the case where Σ = Id − θvv > in order to map to NEG - SPCA, as X can be
generated from Σ̂ by randomly sampling this Haar measure. This measure can then be sampled efficiently
by applying Gram-Schmidt to the rows of an m × n matrix of independent standard Gaussians.
Let Wm (n, Σ) be the law of Σ̂, or in other words the Wishart distribution with covariance matrix Σ, and
let Wm −1 (n, Σ) denote the distribution of its inverse. The matrices W (n, Σ) and W −1 (n, β · Σ−1 ) where
m m
β −1 = n(n−m−1) have a number of common properties including close low-order moments. Furthermore,
if Σ = Id −θvv > then Σ−1 = Id +θ0 vv > where θ0 = 1−θ θ −1 (n, β ·Σ−1 ) is a rescaling
, which implies that Wm
of the inverse of the empirical covariance matrix of a set of samples from ordinary SPCA. This motivates
our main reduction to NEG - SPCA in Section 9.1, which roughly proceeds in the following two steps.
1. Begin with a small instance of BPC with m = ω(k 2 ) vertices on the left and n on the right. Apply
either the reduction of [BBH18] or [BB19b] to reduce to an ordinary SPCA instance (X1 , X2 , . . . , Xn )
in dimension m with n samples and signal strength θ0 .

2. Form the rescaled empirical covariance matrix Σ̂ = ni=1 Xi Xi> and


P

Y = n(n − m − 1) · Σ̂−1/2 R
p

Output the columns of Y after padding them to be d-dimensional with i.i.d. N (0, 1) random variables.
The key detail in this reduction is that Σ̂1/2 in process of regenerating X from Σ̂ described above has been
replaced by the positive semidefinite square root Σ̂−1/2 of a rescaling of the empirical precision matrix. As
we will show in Section 9.1, establishing total variation guarantees for this reduction amounts to answering
the following nonasymptotic question from random matrix theory that may be of independent interest: when
do Wm (n, Σ) and Wm −1 (n, β · Σ−1 ) converge in total variation for all positive semidefinite matrices Σ? A

simple reduction shows that the general case is equivalent to the isotropic case when Σ = Im . In Section 9.2,

30
we answer this question, showing that these two matrices converge in KL divergence if and only if n  m3 .
Our result is of the same flavor as a number of recent results in random matrix theory showing convergence in
total variation between Wishart and GOE matrices [JL15, BDER16, BG16, RR19]. This condition amounts to
constraining our reduction to the low-sparsity regime k  n1/6 . As discussed in Section 3.5, this condition
does not affect the tightness of our lower bounds and seems to be an artefact of our techniques that possibly
can be removed.

4.6 Completing Tensors from Hypergraphs and Tensor PCA


As alluded to in the above discussion of rejection kernels, it is important that the entries in the vectors to
which we apply dense Bernoulli rotations are independent and that none of these entries is missing. In
the context of reductions beginning with k- PC, k- HPC, PC and HPC, establishing this entails pre-processing
steps to remove the symmetry of the input adjacency matrix and add in missing entries. As discussed in
Section 1.1 of [BBH19], these missing entries in the matrix case have led to technical complications in
the prior reductions in [HWX15, BBH18, BBH19, BB19b]. In reductions to tensor PCA, completing these
pre-processing steps in the tensor case seems unavoidable in order to produce the canonical formulation of
tensor PCA with a symmetric rank-1 spike v ⊗s as discussed in Section 3.8.
In order to motivate our discussion of the tensor case, we first consider the matrix case. Asymmetrizing
the adjacency matrix of an input PC instance can be achieved through a simple application of Bernoulli
cloning, but adding in the missing diagonal entries is more subtle. Note that the desired diagonal entries
contain nontrivial information about the vertices in the planted clique – they are constrained to be 1 along
the vertices of the clique and independent Bern(1/2) random variables elsewhere. This is roughly the infor-
mation gained on revealing a single vertex from the planted clique. In the matrix case, the following trick
effectively produces an instance of PC with the diagonal entries present. Add in 1’s along the entire diagonal
and randomly embed the resulting matrix as a principal minor in a larger matrix with off-diagonal entries
sampled from Bern(1/2) and on-diagonal entries sampled so that the total number of 1’s on the diagonal has
the correct binomial distribution. This trick appeared in the T O -S UBMATRIX procedure in [BBH19] for gen-
eral PDS instances, and is adapted in this work for k- PDS as the reduction T O -k-PARTITE -S UBMATRIX in
Section 7. This reduction is an important pre-processing step in mapping to dense stochastic block models,
testing hidden partition models and semirandom planted dense subgraph.
The tensor case is not as simple as the matrix case. While asymmetrizing can be handled similarly
with Bernoulli cloning, the missing entries of the adjacency tensor of HPC are now more numerous and
correspond to any entry with two equal indices. Unlike in the matrix case, the information content in these
entries alone is enough to solve HPC. For example, in 3-uniform HPC, the missing set of entries (i, i, j)
should have the same distribution as the completed adjacency matrix of an entire instance of planted clique
with the same hidden clique vertices. Thus a reduction that randomly generates these missing entries as in
the matrix case is no longer possible without knowing the solution to the input HPC instance. However, if
an oracle were to have revealed a single vertex of the hidden clique, we would be able to use the hyperedges
containing this vertex to complete the missing entries of the adjacency tensor. In general, given an HPC
instance of arbitrary order s, a more involved cloning and embedding procedure detailed in Section 11
completes the missing entries of the adjacency tensor given oracle access to s − 1 vertices of the hidden
clique. Our reduction to tensor PCA in Sections 11 and 15 iterates over all (s − 1)-tuples of vertices in
the input HPC instance, uses this procedure to complete the missing entries of the adjacency tensor, applies
dense Bernoulli rotations as described previously and then feeds the output instance to a blackbox solving
tensor PCA. The reduction only succeeds in mapping to the correct distribution on tensor PCA in iterations
that successfully guess s−1 vertices of the planted clique. However, we show that this is sufficient to deduce
tight computational lower bounds for tensor PCA. We remark that this reduction is the first reduction in total
variation from PCρ that seems to require multiple calls to a blackbox solving the target problem.

31
4.7 Symmetric 3-ary Rejection Kernels and Universality
So far, all of our reductions have been to problems with Gaussian or Bernoulli data and our techniques have
often relied heavily on the properties of jointly Gaussian vectors. Our last reduction technique shows that
the consequences of these reductions extend far beyond Gaussian and Bernoulli problems. We introduce a
new rejection kernel in Section 7.3 and show in Section 16 that, when applied entrywise to the output of our
reduction to RSME when  = 1/2, this rejection kernel yields a universal computational lower bound for a
general variant of learning sparse mixtures with nearly arbitrary marginals.
Because sparse mixture models necessarily involve at least three distinct marginal distributions, a deficit
in degrees of freedom implies that the existing framework for rejection kernels with binary entries cannot
yield nontrivial hardness. We resolve this issue by considering rejection kernels with a slightly larger input
space, and introduce a general framework for 3-ary rejection kernels with entries in {−1, 0, 1} in Section
7.3. We show in Section 16 that first mapping each entry of our RSME instance with  = 1/2 into {−1, 0, 1}
by thresholding at intervals of the form (−∞, −T ], (−T, T ) and [T, ∞) with T = Θ(1) and then applying
3-ary rejection kernels entrywise is a nearly lossless reduction. In particular, it yields new computational
lower bounds for a wide universality class that tightly recover optimal computational lower bounds for
sparse PCA, learning mixtures of exponentially distributed data, the original RSME instance with  = 1/2
and many other sparse mixture formulations. The implications of this reduction are discussed in detail in
Section 16.2.

4.8 Encoding Cliques as Structural Priors


As discussed in Section 1.2, reductions from PCρ showing tight computational lower bounds cannot generate
a non-negligible part of the hidden structure in the target problem themselves, but instead must encode
the hidden clique of the input instance into this structure. In this section, we outline how our reductions
implicitly encode hidden cliques. Note that the hidden subset of vertices corresponding to a clique in PCρ
has Θ(k log n) bits of entropy while the distribution over the hidden structure in the target problems that
we consider can have much higher entropy. For example, the Rademacher prior on the planted vector v in
Tensor PCA has n bits of entropy and the distribution over hidden partitions in testing partition models has
entropy Θ(r2 K 2 log n log r).
Although our reductions inject randomness to produce the desired noise distributions of target problems,
the induced maps encoding the clique in PCρ as a new hidden structure typically do not inject randomness.
Consequently, our reductions generally show hardness for priors over the hidden structure in our target
problems with entropy Θ(k log n). This then implies a lower bound for our target problems, because the
canonical uniform priors with which they are defined are the hardest priors. For example, every instance
of PCρ reduces to uniform prior over cliques as in PC by randomly relabelling nodes. Similarly, a tensor
PCA instance with a fixed planted vector v reduces to the formulation in which v is uniformly distributed
on {−1, 1}n by taking the entrywise product of the tensor PCA instance with u⊗s where u is chosen u.a.r.
from {−1, 1}n . Thus our reductions actually show slightly stronger computational lower bounds than those
stated in our main theorems – they show lower bounds for our target problems with nonuniform priors on
their hidden structures. These nonuniform priors arise from the encodings of planted cliques into target
hidden structure implicitly in our reductions, several of which we summarize below. Our reductions often
involve aesthetic pre-processing and post-processing steps to reduce to canonical uniform priors and often
subsample the output instance. To simplify our discussion, we omit these steps in describing the clique
encodings induced by our reductions.

• Robust Sparse Mean Estimation and SLR: Let SL and SR be the sets of left and right clique vertices
of the input k- BPC instance and let [N ] = E1 ∪ E2 ∪ · · · ∪ EkN be the given partition of the right
vertices. The support of the k-sparse vector in our output RSME and RSLR instances is simply SL . Let

32
r be a prime and let E10 ∪ E20 ∪ · · · ∪ Ek0 N be a partition of the output n samples into parts of size
t −1
r` where ` = rr−1 . Label each of element of Ei0 with a affine shift of a hyperplane in Ftr and each
element of Ei with a point of Ftr . For each i, our adversary corrupts each sample in Ei0 corresponding
to an affine shift of a hyperplane containing the point corresponding to the unique element in SR ∩ Ei .

• Dense Stochastic Block Models: Let S be the set of clique vertices of the input k- PC instance and
let E be the given partition of the its vertices [N ]. Let E 0 be a partition of the output n vertices
again into parts of size r`. Label elements in each part as above. Our output ISBM instance has its
smaller community supported on the union of the vertices across all Ei0 corresponding to affine shifts
containing the points in Ftr corresponding to the vertices S.

• Mixtures of SLRs and Generalized Learning Sparse Mixtures: Let SL , SR , k, kN , N, n and E be


as above. The support of the k-sparse vector in our output MSLR and GLSM instances is again simply
t
SL . Let H1 , H2 , . . . , H2t −1 ∈ {−1, 1}2 be the zero-sum rows of a Hadamard matrix and let E 0 be a
partition of the output n samples into kN blocks of size 2t . The output instance sets the jth sample in
Ei0 to be from the first part of the mixture if and only if the jth entry of Hs is 1 where s is the unique
element in SR ∩ Ei . In other words, the mixture pattern along Ei0 is given by the (SR ∩ Ei )th row of
a Hadamard matrix.

• Tensor PCA: Let S be the set of clique vertices of the input k- HPC instance and let E and N be
as above. Similarly to MSLR and GLSM, the planted vector v of our output TPCA instance is the
concatenation of the (S ∩ Ei )th rows of a Hadamard matrix.

Our reduction to testing hidden partition models induces a more intricate encoding of cliques similar to that
of dense stochastic block models described above. We remark that each of these encodings arises directly
from design matrices and tensors based on Kr,t used in the dense Bernoulli rotation step of our reductions.

5 Further Directions and Open Problems


In this section, we describe several further directions and problems left open in this work. These directions
mainly concern the PCρ conjecture and our reduction techniques.

Further Evidence for PCρ Conjectures. In this work, we give evidence for the PCρ conjecture from the
failure of low-degree polynomials and for specific instantiations of the PCρ conjecture from the failure of
SQ algorithms. An interesting direction for future work is to show sum of squares lower bounds for PCρ
and k- HPCs supporting this conjecture. A priori, this seems to be a technically difficult task as the sum of
squares lower bounds in [BHK+ 16] only apply to the prior in planted clique where every vertex is included
in the clique independently with probability k/n. Thus it even remains open to extend these lower bounds
to the uniform prior over k-subsets of [n].

How do Priors on Hidden Structure Affect Hardness? In this work, we showed that slightly altering the
prior over the hidden structure of PC gave rise to a problem much more amenable to average-case reductions.
This raises a broad question: for general problems P with hidden structure, how does changing the prior
over this hidden structure affect its hardness? In other words, for natural problems other than PC, how does
the conjectured computational barrier change with ρ? Another related direction for future work is whether
other choices of ρ in the PCρ conjecture give meaningful assumptions that can be mapped to more natural
problems than the ones we consider here. Furthermore, it would be interesting to study how reductions carry
ensembles of problems with a general prior ρ to one another. For instance, is there a reduction between PC

33
and another problem, such as SPCA, such that every hard prior in PCρ is mapped to a corresponding hard
prior in SPCA?

Generalizations of Dense Bernoulli Rotations. In this work, dense Bernoulli rotations were an extremely
important subroutine, serving as our simplest primitive for transforming hidden structure. An interesting
technical direction for future work is to find similar transformations mapping to other distributions. More
concretely, dense Bernoulli rotations approximately mapped from PB(n, i, 1, 1/2) to the n distributions
Di = N (c · Ai , Im ), respectively, and mapped from Bern(1/2)⊗m to D = N (0, Im ). Are there other sim-
ilar reductions mapping from these planted bit distributions to different ensembles of D, D1 , D2 , . . . , Dn ?
Furthermore, can these maps be used to show tight computational lower bounds for natural problems? For
example, two possibly interesting ensembles of D, D1 , D2 , . . . , Dn are:

1. Di = ⊗m −α ) and some D where P ∈ [0, 1]n×m is a fixed matrix of constants and α > 0.
j=1 Bern(Pij n

2. Di = N (c · Ai , Im − c2 Ai A>
i ) and D = N (0, Im ).

The first example above corresponds to whether or not there is a sparse analogue of Bernoulli rotations that
can be used to show tight computational lower bounds. A natural approach to (1) is to apply √ dense Bernoulli
rotations and map each entry into {0, 1} by thresholding at some large real number T = Θ( log n). While
this maps to an ensemble of the form in (1), this reduction seems lossy, in the sense that it discards signal in
the input instance, and it does not appear to show tight computational lower bounds for any natural problem.
The second example above presents a set of Di with the same expected covariance matrices as D. Note
that in ordinary dense Bernoulli rotations the expected covariance matrices for each i are Im + c2 · Ai A> i
and often a degree-2 polynomial suffices to distinguish them from D. More generally, a natural question is:
are there analogues of dense Bernoulli rotations that are tight to algorithms given by polynomials of degree
higher than 2?

General Reductions to Supervised Problems. Our last open problem is more concrete than the previous
two. In our reductions to MSLR and RSLR, we crucially use a subroutine mapping to NEG - SPCA. This
subroutine requires that k = õ(n1/6 ) in order to show convergence in KL divergence between the Wishart
and inverse Wishart distributions. Is there a reduction that relaxes this requirement to k = õ(nα ) where
1/6 < α < 1/2? Providing a reduction for α arbitrarily close to 1/2 would essentially fill out all parameter
regimes of interest in our computational lower bounds for MSLR and RSLR. Any reduction relaxing this
constraint to some α with α > 1/6 seems as though it would require new techniques and be technically
interesting. Another question related to our reductions to MSLR and RSLR is: can our label generation
technique be generalized to handle more general link functions σ i.e. generalized linear models where each
sample-label pair (X, y) satisfies y = σ(hβ, Xi) + N (0, 1)? In particular, is there a reduction mapping to
the canonical formulation of sparse phase retrieval with σ(t) = t2 ? Although the statistical-computational
gap for this formulation of sparse phase retrieval seems closely related to our computational lower bound
for MSLR, any such reduction seems as though it would be interesting from a technical viewpoint.

34
Part II
Average-Case Reduction Techniques
6 Preliminaries and Problem Formulations
In this section, we establish notation and some preliminary observations for proving our main theorems
from Section 3. We already defined our notion of computational lower bounds and solving detection and
recovery problems in Section 3. In this section, we begin by stating our conventions for detection problems
and adversaries. In Section 6.2, we introduce the framework for reductions in total variation to show compu-
tational lower bounds for detection problems. In Section 6.3, we then state detection formulations for each
of our problems of interest that it will suffice to exhibit reductions to. Finally, in Section 6.4, we introduce
the key notation that will be used throughout the paper. Later in Section 17, we discuss how our reductions
and lower bounds for the detection formulations in Section 6.3 imply lower bounds for natural estimation
and recovery variants of our problems.

6.1 Conventions for Detection Problems and Adversaries


We begin by describing our general setup for detection problems and the notions of robustness and types
adversaries that we consider.

Detection Problems. In a detection task P, the algorithm is given a set of observations and tasked with
distinguishing between two hypotheses:

• a uniform hypothesis H0 corresponding to the natural noise distribution for the problem; and

• a planted hypothesis H1 , under which observations are generated from this distribution but with a
latent planted structure.

Both H0 and H1 can either be simple hypotheses consisting of a single distribution or a composite hypothesis
consisting of multiple distributions. Our problems typically are such that either: (1) both H0 and H1 are
simple hypotheses; or (2) both H0 and H1 are composite hypotheses consisting of the set of distributions
that can be induced by some constrained adversary.
As discussed in [BBH18] and [HWX15], when detection problems need not be composite by defini-
tion, average-case reductions to natural simple vs. simple hypothesis testing formulations are stronger and
technically more difficult. In these cases, composite hypotheses typically arise because a reduction gadget
precludes mapping to the natural simple vs. simple hypothesis testing formulation. We remark that simple
vs. simple formulations are the hypothesis testing problems that correspond to average-case decision prob-
lems (L, D) as in Levin’s theory of average-case complexity. A survey of average-case complexity can be
found in [BT+ 06b].

Adversaries. The robust estimation literature contains a number of adversaries capturing different notions
of model misspecification. We consider the following three central classes of adversaries:

1. -corruption: A set of samples (X1 , X2 , . . . , Xn ) is an -corrupted sample from a distribution D if


they can be generated by giving a set of n samples drawn i.i.d. from D to an adversary who then
changes at most n of them arbitrarily.

35
2. Huber’s contamination model: A set of samples (X1 , X2 , . . . , Xn ) is an -contamination of D in
Huber’s model if
X1 , X2 , . . . , Xn ∼i.i.d. MIX (D, DO )
where DO is an unknown outlier distribution chosen by an adversary. Here, MIX (D, DO ) denotes the
-mixture distribution formed by sampling D with probability (1 − ) and DO with probability .

3. Semirandom adversaries: Suppose that D is a distribution over collections of observations {Xi }i∈I
such that an unknown subset P ⊆ I of indices correspond to a planted structure. A sample {Xi }i∈I
is semirandom if it can be generated by giving a sample from D to an adversary who is allowed
decrease Xi for any i ∈ I\P . Some formulations of semirandom adversaries in the literature also
permit increases in Xi for any i ∈ P . Our lower bounds apply to both adversarial setups.

All adversaries in these models of robustness are computationally unbounded and have access to randomness
– meaning that they also have access to any hidden structure in a problem that can be recovered information
theoretically. Given a single distribution D over a set X, any one of these three adversaries produces a set
of distributions ADV(D) that can be obtained after corruption. When formulated as detection problems, the
hypotheses H0 and H1 are of the form ADV(D) for some D. We remark that -corruption can simulate
contamination in Huber’s model at a slightly smaller 0 within o(1) total variation. This is because a sam-
ple from Huber’s model has Bin(n, 0 ) samples from DO . An adversary resampling min{Bin(n, 0 ), n}
samples from DO therefore simulates Huber’s model within a total variation distance bounded by standard
concentration for the Binomial distribution.

6.2 Reductions in Total Variation and Computational Lower Bounds


In this section, we introduce our framework for reductions in total variation, state a general condition for
deducing computational lower bounds from reductions in total variation and state a number of properties of
total variation that we will use in analyzing our reductions.

Average-Case Reductions in Total Variation. We give approximate reductions in total variation to show
that lower bounds for one hypothesis testing problem imply lower bounds for another. These reductions yield
an exact correspondence between the asymptotic Type I+II errors of the two problems. This is formalized
in the following lemma, which is Lemma 3.1 from [BBH18] stated in terms of composite hypotheses H0
and H1 . The main quantity in the statement of the lemma can be interpreted as the smallest total variation
distance between the reduced object A(X) and the closest mixture of distributions from either H00 or H10 .
The proof of this lemma is short and follows from the definition of total variation. Given a hypothesis Hi ,
we let ∆(Hi ) denote the set of all priors over the set of distributions valid under Hi .

Lemma 6.1 (Lemma 3.1 in [BBH18]). Let P and P 0 be detection problems with hypotheses H0 , H1 and
H00 , H10 , respectively. Let X be an instance of P and let Y be an instance of P 0 . Suppose there is a
polynomial time computable map A satisfying

sup inf 0
dTV (LP (A(X)), EP 0 ∼π LP 0 (Y )) + sup inf 0
dTV (LP (A(X)), EP 0 ∼π LP 0 (Y )) ≤ δ
P ∈H0 π∈∆(H0 ) P ∈H1 π∈∆(H1 )

If there is a randomized polynomial time algorithm solving P 0 with Type I+II error at most , then there is
a randomized polynomial time algorithm solving P with Type I+II error at most  + δ.

If δ = o(1), then given a blackbox solver B for PD0 , the algorithm that applies A and then B solves P
D
and requires only a single query to the blackbox. We now outline the computational model and conventions
we adopt throughout this paper. An algorithm that runs in randomized polynomial time refers to one that has

36
access to poly(n) independent random bits and must run in poly(n) time where n is the size of the instance
of the problem. For clarity of exposition, in our reductions we assume that explicit real-valued expressions
can be exactly computed and that we can sample a biased random bit Bern(p) in polynomial time. We also
assume that the sampling and density oracles described in Definition 7.6 can be computed in poly(n) time.
For simplicity of exposition, we assume that we can sample N (0, 1) in poly(n) time.

Deducing Strong Computational Lower Bounds for Detection from Reductions. Throughout Part III,
we will use the guarantees for our reductions to show computational lower bounds. For clarity and to avoid
redundancy, we will outline a general recipe for showing these hardness results. All lower bounds that will
be shown in Part III are computational lower bounds in the sense introduced in the beginning of Section 6.1.
Consider a problem P with parameters (n, a1 , a2 , . . . , at ) and hypotheses H0 and H1 with a conjectured
computationally hard regime captured by the constraint set C. In order to show a computational lower
bound at C based on one of our hardness assumptions, it suffices to show that the following is true:

Condition 6.1 (Computational Lower Bounds from Reductions). For all sequences of parameters satisfying
the lower bound constraints {(n, a1 (n), a2 (n), . . . , at (n))}∞
n=1 ⊆ C, there are:

1. another sequence of parameters {(ni , a01 (ni ), a02 (ni ), . . . , a0t (ni ))}∞
i=1 ⊆ C such that

log a0k (ni )


lim =1
i→∞ log ak (ni )

2. a sequence of instances {Gi }∞ 0 0


i=1 of a problem PC ρ with hypotheses H0 and H1 that cannot be solved
in polynomial time according to Conjecture 2.3; and

3. a polynomial time reduction R such that if P(ni , a01 (ni ), a02 (ni ), . . . , a0t (ni )) has an instance denoted
by Xi , then

dTV R(Gi |H00 ), L(Xi |H0 ) = oni (1) and dTV R(Gi |H10 ), L(Xi |H1 ) = oni (1)
 

This can be seen to suffice as follows. Suppose that A solves P for some possible growth rate in C i.e.
there is a sequence {(ni , a01 (ni ), a02 (ni ), . . . , a0t (ni ))}∞
i=1 ⊆ C with this growth rate such that A has Type
I+II error 1 − Ωni (1) on P(ni , a01 (ni ), a02 (ni ), . . . , a0t (ni )). By Lemma 6.1, it follows that A ◦ R also has
Type I+II error 1 − Ωni (1) on the sequence of inputs {Gi }∞ i=1 , which contradicts the conjecture that they
are hard instances. The three conditions above will be verified in a number of theorems in Part III.

Remarks on Deducing Computational Lower Bounds. We make several important remarks on the
recipe outlined above. In all of our applications of Condition 6.1, the second sequence of parameters
(ni , a01 (ni ), a02 (ni ), . . . , a0t (ni )) will either be exactly a subsequence of the original parameter sequence
(n, a1 (n), a2 (n), . . . , at (n)) or will have one parameter a0i 6= ai different from the original. However, the
ability to pass to a subsequence will be crucial in a number of cases where number-theoretic constraints
on parameters impact the tightness of our computational lower bounds. These constraints will arise in our
reductions to robust sparse mean estimation, robust SLR and dense stochastic block models. They are
discussed more in Section 13.

Properties of Total Variation. The analysis of our reductions will make use of the following well-known
facts and inequalities concerning total variation distance.

Fact 6.2. The distance dTV satisfies the following properties:

37
1. (Tensorization) Let P1 , P2 , . . . , Pn and Q1 , Q2 , . . . , Qn be distributions on a measurable space (X , B).
Then
n n n
!
Y Y X
dTV Pi , Qi ≤ dTV (Pi , Qi )
i=1 i=1 i=1

2. (Conditioning on an Event) For any distribution P on a measurable space (X , B) and event A ∈ B,


it holds that
dTV (P (·|A), P ) = 1 − P (A)

3. (Conditioning on a Random Variable) For any two pairs of random variables (X, Y ) and (X 0 , Y 0 )
each taking values in a measurable space (X , B), it holds that

dTV L(X), L(X 0 ) ≤ dTV L(Y ), L(Y 0 ) + Ey∼Y dTV L(X|Y = y), L(X 0 |Y 0 = y)
   

where we define dTV (L(X|Y = y), L(X 0 |Y 0 = y)) = 1 for all y 6∈ supp(Y 0 ).

Given an algorithm A and distribution P on inputs, let A(P) denote the distribution of A(X) induced
by X ∼ P. If A has k steps, let Ai denote the ith step of A and Ai-j denote the procedure formed by steps
i through j. Each time this notation is used, we clarify the intended initial and final variables when Ai and
Ai-j are viewed as Markov kernels. The next lemma from [BBH19] encapsulates the structure of all of our
analyses of average-case reductions. Its proof is simple and included in Appendix A.1 for completeness.

Lemma 6.3 (Lemma 4.2 in [BBH19]). Let A be an algorithm that can be written as A = Am ◦ Am−1 ◦
· · · ◦ A1 for a sequence of steps A1 , A2 , . . . , Am . Suppose that the probability distributions P0 , P1 , . . . , Pm
are such that dTV (Ai (Pi−1 ), Pi ) ≤ i for each 1 ≤ i ≤ m. Then it follows that
m
X
dTV (A(P0 ), Pm ) ≤ i
i=1

The next lemma bounds the total variation between unplanted and planted samples from binomial dis-
tributions. This will serve as a key computation in the proof of correctness for the reduction primitive
T O -k-PARTITE -S UBMATRIX. We remark that the total variation upper bound in this lemma is tight in the
following sense. When all of the Pi are the same, the expected value of the sum of the coordinates of the
first distribution is k(Pi − Q) higher than that of the second. The standard deviation of the second sum is
p
kmQ(1 − Q) and thus when k(Pi − Q)2  mQ(1 − Q), the total variation below tends to one. The
proof of this lemma can be found in Appendix A.1.

Lemma 6.4. If k, m ∈ N, P1 , P2 , . . . , Pk ∈ [0, 1] and Q ∈ (0, 1), then


v
u k
  uX (Pi − Q)2
dTV ⊗ki=1 (Bern(Pi ) + Bin(m − 1, Q)) , Bin(m, Q)⊗k ≤t
2mQ(1 − Q)
i=1

Here, L1 + L2 denotes the convolution of two given probability measures L1 and L2 . The next lemma
bounds the total variation between two binomial distributions. Its proof can be found in Appendix A.1.

Lemma 6.5. Given P ∈ [0, 1], Q ∈ (0, 1) and n ∈ N, it follows that


r
n
dTV (Bin(n, P ), Bin(n, Q)) ≤ |P − Q| ·
2Q(1 − Q)

38
6.3 Problem Formulations as Detection Tasks
In this section, we formulate each problem for which we will show computational lower bounds as a detec-
tion problem. More precisely, for each problem P introduced in Section 3, we introduce a detection variant
P 0 such that a blackbox for P also solves P 0 . Some of these formulations were already implicitly intro-
duced or will be reintroduced in future sections. We gather all of these formulations here for convenience.
Throughout this work, to simplify notation, we will refer to problems P and their detection formulations
P 0 introduced in this section using the same notation. Furthermore, we will often denote the distribution
over instances under the alternative hypothesis H1 of the detection formulation for P with the notation PD ,
when H1 is a simple hypothesis. We will also often parameterize PD by θ to denote PD conditioned on the
latent hidden structure θ. When H1 is composite, PD denotes the set of distributions permitted under H1 .
These general conventions are introduced on a per problem basis in this section. In Section 17, we show that
our reductions and lower bounds for these detection formulations also imply lower bounds for analogous
estimation and recovery variants.

Robust Sparse Mean Estimation. Our hypothesis testing formulation for the problem RSME(n, k, d, τ, )
has hypotheses given by

H0 : (X1 , X2 , . . . , Xn ) ∼i.i.d. N (0, Id )


H1 : (X1 , X2 , . . . , Xn ) ∼i.i.d. MIX (N (τ · µR , Id ), DO )

where DO is any adversarially chosen outlier distribution on Rd , where µR ∈ Rd √ is a random k-sparse


unit vector chosen uniformly at random from all such vectors with entries in {0, 1/ k}. Note that H1 is
a composite hypothesis here since DO is arbitrary. Note also that this is a formulation of RSME in Huber’s
contamination model, and therefore lower bounds for this detection problem imply corresponding lower
bounds under stronger -corruption adversaries.
As discussed in Section 3.1, RSME is only information-theoretically feasible when τ = Ω(). Consider
any algorithm that produces some estimate µ̂ satisfying that kµ̂ − µk2 < τ /2 with probability 1/2 + Ω(1)
in the estimation formulation for RSME with hidden k-sparse vector µ, as described in Section 3.1. This
algorithm would necessarily output some µ̂ with kµ̂k2 < τ /2 under H0 and some µ̂ with kµ̂k2 > τ /2 under
H1 with probability 1/2 + Ω(1) in the hypothesis testing formulation above, thus solving it in the sense of
Section 3. Thus any computational lower bounds for this hypothesis testing formulation also implies a lower
bound for the typical estimation formulation of RSME.

Dense Stochastic Block Models. Given a subset C1 ⊆ [n] of size n/k, let ISBMD (n, C1 , P11 , P12 , P22 )
denote the distribution on n-vertex graphs G0 introduced in Section 3.2 conditioned on C1 . Furthermore, let
ISBM D (n, k, P11 , P12 , P22 ) denote the mixture of these distributions induced by choosing C1 uniformly at
random from the (n/k)-subsets of [n]. The problem ISBM(n, k, P11 , P12 , P22 ) introduced in Section 3.2 is
already a hypothesis testing problem, with hypotheses

H0 : G ∼ G (n, P0 ) and H1 : G ∼ ISBMD (n, k, P11 , P12 , P22 )

where H0 is a composite hypothesis and P0 can vary over all edge densities in (0, 1). As we will discuss at
the end of this section, computational lower bounds for this hypothesis testing problem imply lower bounds
for the problem of recovering the hidden community C1 .

Testing Hidden Partition Models. Let C = (C1 , C2 , . . . , Cr ) and D = (D1 , D2 , . . . , Dr ) be two fixed
sequences, each consisting of disjoint K-subsets of [n]. Let GHPMD (n, r, C, D, γ) denote the distribution

39
over random matrices M ∈ Rn×n introduced in Section 3.3 conditioned on the fixed sequences C and D.
We denote the mixture over these distributions induced by choosing C and D independently and uniformly at
random from all admissible such sequences as GHPMD (n, r, K, γ). Similarly, we let BHPMD (n, r, C, P0 , γ)
denote the distribution over bipartite graphs G with two parts of size n, each indexed by [n] with edges
included independently with probability

 P0 + γ if i ∈ Ch and j ∈ Dh for some h ∈ [r]
γ
P [(i, j) ∈ E(G)] = P0 − r−1 if i ∈ Ch1 and j ∈ Dh2 where h1 6= h2

P0 otherwise

where P0 , γ ∈ (0, 1) be such that γ/r ≤ P0 ≤ 1 − γ. Then let BHPMD (n, r, K, P0 , γ) denote the
mixture formed by choosing C and D randomly as in GHPMD . The problems GHPM(n, r, C, D, γ) and
BHPM (n, r, K, P0 , γ) are simple hypothesis testing problems given by

H0 : M ∼ N (0, 1)⊗n×n and H1 : M ∼ GHPMD (n, r, K, γ)


H0 : G ∼ GB (n, n, P0 ) and H1 : G ∼ BHPMD (n, r, K, P0 , γ)

where GB (n, n, P0 ) denotes the Erdős-Rényi distribution over bipartite graphs with two parts each indexed
by [n] and where each edge is included independently with probability P0 .

Semirandom Planted Dense Subgraph. Our hypothesis testing formulation for SEMI - CR(n, k, P1 , P0 )
has observation G ∈ Gn and two composite hypotheses given by

H0 : G ∼ P0 for some P0 ∈ ADV (G(n, P0 ))


H1 : G ∼ P1 for some P1 ∈ ADV (G(n, k, P1 , P0 ))

Here, ADV (G(n, k, P1 , P0 )) denotes the set of distributions induced by a semirandom adversary that can
only remove edges outside of the planted dense subgraph S. Similarly, the set ADV (G(n, P0 )) corresponds
to an adversary that can remove any edges from the Erdős-Rényi graph G(n, P0 ). We will discuss at the end
of this section, how computational lower bounds for this hypothesis testing formulation imply lower bounds
for the problem of approximately recovering the vertex subset corresponding to the planted dense subgraph.

Negative Sparse PCA. Our hypothesis testing formulation for NEG - SPCA(n, k, d, θ) is the spiked covari-
ance model introduced in [JL04] and used to formulate ordinary SPCA in [GMZ17, BBH18, BB19b]. This
problem has hypotheses given by

H0 : (X1 , X2 , . . . , Xn ) ∼i.i.d. N (0, Id )


 
H1 : (X1 , X2 , . . . , Xn ) ∼i.i.d. N 0, Id − θvv >

where v ∈ Rd is a k-sparse unit vector with entries in {0, 1/ k} chosen uniformly at random.

Unsigned and Mixtures of SLRs. Given a vector v ∈ Rd , let LRd (v) be the distribution of a single
sample-label pair (X, y) ∈ Rd × R given by

y = hv, Xi + η where X ∼ N (0, Id ) and η ∼ N (0, 1) are independent

Given a subset S ⊆ [n], let MSLRD (n, S, d, τ, 1/2) denote the distribution over n independent sample-label
pairs (X1 , y1 ), (X2 , y2 ), . . . , (Xn , yn ) each distributed as

(Xi , yi ) ∼ LRd (τ si vS ) where si ∼i.i.d. Rad

40
where vS = |S|−1/2 · 1S and Rad denotes the Rademacher distribution which is uniform over {−1, 1}.
Note that this is a even mixture of sparse linear regressions with hidden unit vectors vS and −vS and signal
strength τ . Let MSLRD (n, k, d, τ, 1/2) denote the mixture of these distributions induced by choosing S
uniformly at random from all k-subsets of [n]. Our hypothesis testing formulation for MSLR(n, k, d, τ ) has
two simple hypotheses given by
⊗n
H0 : {(Xi , yi )}i∈[n] ∼ N (0, Id ) ⊗ N 0, 1 + τ 2
H1 : {(Xi , yi )}i∈[n] ∼ MSLRD (n, k, d, τ, 1/2)

Our hypothesis testing formulation of USLR(n, k, d, τ ) is a simple derivative of this formulation obtained
by replacing each observation (Xi , yi ) with (Xi , |yi |). We remark that, unlike RSME where an estimation
algorithm trivially solved the hypothesis testing formulation, the hypothesis H0 here is not an instance of
MSLR corresponding to a hidden vector of zero. This is because the labels yi under H0 have variance 1 + τ 2 ,
whereas they would have variance 1 if they were this instance of MSLR. However, this detection problem
still yields hardness for the estimation variants of MSLR and USLR described in Section 3.6, albeit with a
slightly more involved argument. This is discussed in Section 17.

Robust SLR. Our hypothesis testing formulation for RSLR(n, k, d, τ, ) has hypotheses given by
⊗n
H0 : {(Xi , yi )}i∈[n] ∼ N (0, Id ) ⊗ N 0, 1 + τ 2
H1 : {(Xi , yi )}i∈[n] ∼i.i.d. MIX (LRd (τ v), DO )

where DO is any adversarially chosen outlier distribution on Rd × R, where v ∈ Rd is√a random k-sparse
unit vector chosen uniformly at random from all such vectors with entries in {0, 1/ k}. As with the
other formulations of SLR, we defer discussing the implications of lower bounds in this formulation for the
estimation task described in Section 3.7 to Section 17.

⊗s
Tensor PCA. Let TPCAsD (n, θ) denote the distribution on order s tensors T ∈ Rn with dimensions all
⊗s
equal to n given by T = v ⊗s + G where G ∼ N (0, 1)⊗n and v ∈ {−1, 1}n is chosen independently
and uniformly at random. As already introduced in Section 3.8, our hypothesis testing formulation for
TPCA s (n, θ) is given by
⊗s
H0 : T ∼ N (0, 1)⊗n and H1 : T ∼ TPCAsD (n, θ)

Unlike the other problems we consider, our reductions only show computational lower bounds for black-
boxes solving this hypothesis testing problem with a low false positive probability. As we will show in
Section 15, this implies a lower bound for the canonical estimation formulation for tensor PCA.

Generalized Learning Sparse Mixtures. Let {Pµ }µ∈R and Q be distributions on an arbitrary measurable
space (X , B) and let D be a mixture distribution on R. Let GLSMD (n, S, d, {Pµ }µ∈R , Q, D) denote the
distribution over X1 , X2 , . . . , Xn ∈ X d introduced in Section 3.9 and let GLSMD (n, k, d, {Pµ }µ∈R , Q, D)
denote the mixture over these distributions induced by sampling S uniformly at random from the family
of k-subsets of [n]. Our general sparse mixtures detection problem GLSM(n, S, d, {Pµ }µ∈R , Q, D) is the
following simple vs. simple hypothesis testing formulation

H0 : (X1 , X2 , . . . , Xn ) ∼i.i.d. Q⊗d and H1 : (X1 , X2 , . . . , Xn ) ∼ GLSMD (n, k, d, {Pµ }µ∈R , Q, D)

Lower bounds for this formulation directly imply lower bounds for algorithms that return an estimate Ŝ of
S given samples from GLSMD (n, S, d, {Pµ }µ∈R , Q, D) with |Ŝ∆S| < k/2 with probability 1/2 + Ω(1) for

41
all |S| ≤ k. Note that under H0 , such an algorithm would output some set Ŝ of size less than k/2 and, under
H1 , it would output a set of size greater than k/2, each with probability 1/2 + Ω(1). Thus thresholding |Ŝ|
at k/2 solves this detection formulation in the sense of Section 3.

6.4 Notation
In this section, we establish notation that will be used repeatedly throughout this paper. Some of these
definitions are repeated later upon use for convenience. Let L(X) denote the distribution law of a random
variable X and given two laws L1 and L2 , let L1 + L2 denote L(X + Y ) where X ∼ L1 and Y ∼ L2 are
independent. Given a distribution P, let P ⊗n denote the distribution of (X1 , X2 , . . . , Xn ) where the Xi are
i.i.d. according to P. Similarly, let P ⊗m×n denote the distribution on Rm×n with i.i.d. entries distributed
⊗s
as P. We let Rn denote the set of all order s tensors with dimensions all n in size that contain ns entries.
⊗s
The distribution P ⊗n denotes a tensor of these dimensions with entries independently sampled from P.
We say that two parameters a and b are polynomial in one another if there is a constant C > 0 such that
a1/C ≤ b ≤ aC as a → ∞. In this paper, we adopt the standard asymptotic notation O(·), Ω(·), o(·), ω(·)
and Θ(·). We let a  b, a . b and a & b be shorthands for a = Θ(b), a = O(b) and a = Ω(b), respectively.
In all problems that we consider, our main focus is on the polynomial order of growth at computational
barriers, usually in terms of a natural parameter n. Given a natural parameter n that will usually be clear
from context, we let a = Õ(b) be a shorthand for a = O (b · (log n)c ) for some constant c > 0, and define
Ω̃(·), õ(·), ω̃(·) and Θ̃(·) analogously. Oftentimes, it will be true that b is polynomial in n, in which case n
can be replaced by b in the definition above.
Given a finite or measurable set X , let Unif[X ] denote the uniform distribution on X . Let Rad be
shorthand for Unif[{−1, 1}], corresponding to the special case of a Rademacher random variable. Let dTV ,
dKL and χ2 denote total variation distance, KL divergence and χ2 divergence, respectively. Let N (µ, Σ)
denote a multivariate normal random vector with mean µ ∈ Rd and covariance matrix Σ, where Σ is a
d × d positive semidefinite matrix, and let Bern(p) denote the Bernoulli distribution with probability p.
Let [n] = {1, 2, . . . , n} and Gn be the set of simple graphs on n vertices. Let G(n, p) denote the Erdős-
Rényi distribution over n-vertex graphs where each edge is included independently with probability p. Let
GB (m, n, p) denote the Erdős-Rényi distribution over (m + n)-vertex bipartite graphs with m left vertices,
n right vertices and such that each of the mn possible edges included independently with probability p.
Throughout this paper, we will refer to bipartite graphs with m left vertices and n right vertices and matrices
in {0, 1}m×n interchangeably. Let 1S denote the vector v ∈ Rn with vi = 1 if i ∈ S and vi = 0 if i 6∈ S
where S ⊆ [n]. Let MIX (D1 , D2 ) denote the -mixture distribution formed by sampling D1 with probability
(1 − ) and D2 with probability . Given a partition E of [N ] with k parts, let UN (E) denote the uniform
distribution over all k-subsets of [N ] containing exactly one element from each part of E.
Given a matrix M ∈ Rn×n , the matrix MS,T ∈ Rk×k where S, T are k-subsets of [n] refers to the minor
of M restricted to the row indices in S and column indices in T . Furthermore, (MS,T )i,j = MσS (i),σT (j)
where σS : [k] → S is the unique order-preserving bijection and σT is analogously defined. Given an index
set I, subset S ⊆ I and pair of distributions (P, Q), let MI (S, P, Q) denote the distribution of a collection
of independent random variables (Xi : i ∈ I) with Xi ∼ P if i ∈ S and Xi ∼ Q if i 6∈ S. When S
is a random set, this MI (S, P, Q) denotes a mixture over the randomness of S e.g. M[N ] (UN (E), P, Q)
denotes a mixture of M[N ] (S, P, Q) over S ∼ UN (E). Generally, given an index set I and |I| distributions
P1 , P2 , . . . , P|I| , let MI (Pi : i ∈ I) denote the distribution of independent random variables (Xi : i ∈ I)
with Xi ∼ Pi for each i ∈ I. The planted Bernoulli distribution PB(n, i, p, q) is over V ∈ {0, 1}n with
independent entries satisfying that Vj ∼ Bern(q) unless j = i, in which case Vi ∼ Bern(p). In other
words, PB(n, i, p, q) is a shorthand for M[n] ({i}, Bern(p), Bern(q)). Similarly, the planted dense subgraph
  
distribution G(n, S, p, q) can be written as MI S2 , Bern(p), Bern(q) where I = [n]

2 .

42
7 Rejection Kernels and Reduction Preprocessing
In this section, we present several average-case reduction primitives that will serve as the key subroutines
and preprocessing steps in our reductions. These include pre-existing subroutines from the rejection kernels
framework introduced in [BBH18, BBH19, BB19b], such as univariate rejection kernels from binary inputs
and G AUSSIANIZE. We introduce the primitive T O -k-PARTITE -S UBMATRIX, which is a generalization of
T O -S UBMATRIX from [BBH19] that maps from the k-partite variant of planted dense subgraph to Bernoulli
matrices, by filling in the missing diagonal and symmetrizing. We also introduce a new variant of rejection
kernels called symmetric 3-ary rejection kernels that will be crucial in our reductions showing universality
of lower bounds for sparse mixtures.

7.1 Gaussian Rejection Kernels


Rejection kernels are a framework in [BBH18, BBH19, BB19b] for algorithmic changes of measure based
on rejection sampling. Related reduction primitives for changes of measure to Gaussians and binomial
random variables appeared earlier in [MW15b, HWX15]. Rejection kernels mapping a pair of Bernoulli
distributions to a target pair of scalar distributions were introduced in [BBH18]. These were extended
to arbitrary high-dimensional target distributions and applied to obtain universality results for submatrix
detection in [BBH19]. A surprising and key feature of both of these rejection kernels is that they are not
lossy in mapping one computational barrier to another. For instance, in [BBH19], multivariate rejection
kernels were applied to increase the relative size k of the planted submatrix, faithfully mapping instances
tight to the computational barrier at lower k to tight instances at higher k. This feature is also true of the
scalar rejection kernels applied in [BBH18].
In this work, we will only need a subset of prior results on rejection kernels. In this section, we give
an overview of the key guarantees for Gaussian rejection kernels with binary inputs from [BBH18] and for
G AUSSIANIZE from [BB19b]. We will also need a new ternary input variant of rejection kernels that will be
introduced in Section 7.3. We begin by introducing the Gaussian rejection kernel RKG (µ, B) which maps
B ∈ {0, 1} to a real valued output and is parameterized by some 0 < q < p ≤ 1. The map RKG (µ, B)
transforms two Bernoulli inputs approximately into Gaussians. Specifically, it satisfies the two Markov
transition properties

RK G (µ, B) ≈ N (0, 1) if B ∼ Bern(q) and RK G (µ, B) ≈ N (µ, 1) if B ∼ Bern(p)


−3
√ can be computed in poly(n) time, the ≈ above are up to On (n ) total variation distance
where RKG (µ, B)
and µ = Θ(1/ log n). The maps RKG (µ, B) can be implemented with the rejection sampling scheme
shown in Figure 3. The total variation guarantees for Gaussian rejection kernels are captured formally in the
following theorem.

Lemma 7.1 (Gaussian Rejection Kernels – Lemma 5.4 in [BBH18]). Let RRK be a parameter and suppose
−O(1)
that p = p(RnRK ) and
 q = q(R satisfy that 0 < q < p ≤ 1, min(q, 1 − q) = Ω(1) and p − q ≥ RRK .
 RK )o
Let δ = min log pq , log 1−p 1−q
. Suppose that µ = µ(RRK ) ∈ (0, 1) satisfies that

δ
µ≤ p
2 6 log RRK + 2 log(p − q)−1

Then the map RKG with N = 6δ −1 log RRK iterations can be computed in poly(RRK ) time and satisfies
 

dTV (RKG (µ, Bern(p)), N (µ, 1)) = O R−3 and dTV (RKG (µ, Bern(q)), N (0, 1)) = O R−3
 
RK RK

43
Algorithm RKG (µ, B)
Parameters: Input B ∈ {0, 1}, Bernoulli probabilities 0 < q < p ≤ 1, Gaussian mean µ, number of
iterations N , let ϕµ (x) = √12π · exp − 12 (x − µ)2 denote the density of N (µ, 1)


1. Initialize z ← 0.

2. Until z is set or N iterations have elapsed:

(1) Sample z 0 ∼ N (0, 1) independently.


(2) If B = 0, if the condition
p · ϕ0 (z 0 ) ≥ q · ϕµ (z 0 )
q·ϕµ (z 0 )
holds, then set z ← z 0 with probability 1 − p·ϕ0 (z 0 ) .
(3) If B = 1, if the condition

(1 − q) · ϕµ (z 0 + µ) ≥ (1 − p) · ϕ0 (z 0 + µ)
(1−p)·ϕ0 (z 0 +µ)
holds, then set z ← z 0 + µ with probability 1 − (1−q)·ϕµ (z 0 +µ) .

3. Output z.

Algorithm G AUSSIANIZE
Parameters: Collection of variables Xi ∈ {0, 1} for i ∈ I where I is some index set with |I| = n,
−O(1)
rejection kernel parameter RRK , Bernoulli probabilities 0 < q < p ≤ 1 with p − q = RRK and
min(q, 1 − q) = Ω(1) and a target means 0 ≤ µi ≤ τ for each i ∈ I where τ > 0 is a parameter

1. Form the collection of variables Y ∈ RI by setting

Yi ← RKG (µi , Xi )

for each i ∈ I where −1


n each
 RKG o with parameter RRK and Nit = d6δ log RRK e iterations
 is run
1−q
where δ = min log pq , log 1−p .

2. Output the collection of variables (Yi : i ∈ I).

Figure 3: Gaussian instantiation of the rejection kernel algorithm from [BBH18] and the reduction G AUSSIANIZE for
mapping from Bernoulli to Gaussian planted problems from [BB19b].

The proof of this lemma consists of showing that the distributions of the outputs RKG (µ, Bern(p)) and
1−p
RK G (µ, Bern(q)) are close to N (µ, 1) and N (0, 1) when conditioned to lie in the set of x with 1−q ≤
ϕµ (x)
ϕ0 (x) ≤ pq and then showing that this event occurs with probability close to one. The original framework
in [BBH18] mapped binary inputs to more general pairs of target distributions than N (µ, 1) and N (0, 1),
however we will only require binary-input rejection kernels in the Gaussian. A multivariate extension of
this framework appeared in [BBH19].
Given an index set I, subset S ⊆ I and pair of distributions (P, Q), let MI (S, P, Q) denote the

44
distribution of a collection of independent random variables (Xi : i ∈ I) with Xi ∼ P if i ∈ S and Xi ∼ Q
if i 6∈ S. More generally, given an index set I and |I| distributions P1 , P2 , . . . , P|I| , let MI (Pi : i ∈ I)
denote the distribution of independent random variables (Xi : i ∈ I) with Xi ∼ Pi for each i ∈ I. For 
example, a planted clique in G(n, 1/2) on the set S ⊆ [n] can be written as MI S2 , Bern(1), Bern(1/2)


where I = [n]

2 .
We now review the guarantees for the subroutine G AUSSIANIZE. The variant presented here is restated
from [BB19b] to be over a general index set I rather than matrices, and with the rejection kernel parameter
RRK decoupled from the size n of I, as shown in Figure 3. G AUSSIANIZE maps a set of planted Bernoulli
random variables to a set of independent Gaussian random variables with corresponding planted means. The
procedure applies a Gaussian rejection kernel entrywise and its total variation guarantees follow by a simple
application of the tensorization property of dTV from Fact 6.2.
Lemma 7.2 (Gaussianization – Lemma 4.5 in [BB19b]). Let I be an index set with |I| = n and let RRK ,
0 < q < p ≤ 1 and δ be as in Lemma 7.1. Let µi be such that 0 ≤ µi ≤ τ for each i ∈ I where the
parameter τ > 0 satisfies that
δ
τ≤ p
2 6 log RRK + 2 log(P − Q)−1
The algorithm A = G AUSSIANIZE runs in poly(n, RRK ) time and satisfies that

dTV (A(MI (S, Bern(P ), Bern(Q))), MI (N (µi · 1(i ∈ S), 1) : i ∈ I)) = O n · R−3

RK

for all subsets S ⊆ I.

7.2 Cloning and Planting Diagonals


We begin by reviewing the subroutine G RAPH -C LONE, shown in Figure 4, which was introduced in [BBH19]
and produces several independent samples from a planted subgraph problem given a single sample. Its
properties as a Markov kernel are stated in the next lemma, which is proven by showing the two explicit
expressions for P[xij = v] in Step 1 define valid probability distributions and then explicitly writing the
mass functions of A (G(n, q)) and A (G(n, S, p, q)).
Lemma 7.3 (Graph Cloning – Lemma 5.2 in [BBH19]). Let t ∈ N, 0 < q < p ≤ 1 and 0 < Q < P ≤ 1
satisfy that
1−P t
   t
1−p P p
≤ and ≤
1−q 1−Q Q q
Then the algorithm A = G RAPH -C LONE runs in poly(t, n) time and satisfies that for each S ⊆ [n],

A (G(n, q)) ∼ G(n, Q)⊗t and A (G(n, S, p, q)) ∼ G(n, S, P, Q)⊗t

Graph cloning more generally produces a method to clone a set of Bernoulli random variables indexed
by a general index set I instead of the possible edges of a graph on the vertex set [n]. The guarantees for this
subroutine are stated in the following lemma. We remark that both of these lemmas will always be applied
with t = O(1), resulting in a constant loss in signal strength.
Lemma 7.4 (Bernoulli Cloning). Let I be an index set with |I| = n, let t ∈ N, 0 < q < p ≤ 1 and
0 < Q < P ≤ 1 satisfy that

1−P t
   t
1−p P p
≤ and ≤
1−q 1−Q Q q

45
Algorithm G RAPH -C LONE
Inputs: Graph G ∈ Gn , the number of copies t, parameters 0 < q < p ≤ 1 and 0 < Q < P ≤ 1
 t  t
satisfying 1−p
1−q ≤ 1−P
1−Q and P
Q ≤ pq

1. Generate xij ∈ {0, 1}t for each 1 ≤ i < j ≤ n such that:

• If {i, j} ∈ E(G), sample xij from the distribution on {0, 1}t with
1 h i
P[xij = v] = (1 − q) · P |v|1 (1 − P )t−|v|1 − (1 − p) · Q|v|1 (1 − Q)t−|v|1
p−q

• If {i, j} 6∈ E(G), sample xij from the distribution on {0, 1}t with
1 h i
P[xij = v] = p · Q|v|1 (1 − Q)t−|v|1 − q · P |v|1 (1 − P )t−|v|1
p−q

2. Output the graphs (G1 , G2 , . . . , Gt ) where {i, j} ∈ E(Gk ) if and only if xij
k = 1.

Figure 4: Subroutine G RAPH -C LONE for producing independent samples from planted graph problems from
[BBH19].

There is an algorithm A = B ERNOULLI -C LONE that runs in poly(t, n) time and satisfying

A (MI (Bern(q))) ∼ MI (Bern(Q))⊗t and


A (MI (S, Bern(p), Bern(q))) ∼ MI (S, Bern(P ), Bern(Q))⊗t

for each S ⊆ I.
We now introduce the procedure T O -k-PARTITE -S UBMATRIX, which is shown in Figure 5 and will be
crucial in our reductions to dense variants of the stochastic block model. This reduction clones the upper
half of the adjacency matrix of the input graph problem to produce an independent lower half and plants
diagonal entries while randomly embedding into a larger matrix to hide the diagonal entries in total variation.
T O -k-PARTITE -S UBMATRIX is similar to T O -S UBMATRIX in [BBH19] and T O -B ERNOULLI -S UBMATRIX
in [BB19b] but ensures that the random embedding step accounts for the k-partite promise of the input k-
PDS instance. Completing the missing diagonal entries in the adjacency matrix will be crucial to apply one
of our main techniques, Bernoulli rotations, which will be introduced in the next section.
The next lemma states the total variation guarantees of T O -k-PARTITE -S UBMATRIX and is a k-partite
variant of Theorem 6.1 in [BBH19]. Although technically more subtle than the analysis of T O -S UBMATRIX
in [BBH19], this proof is tangential to our main reduction techniques and deferred to Appendix A.2. Given
a partition E of [N ] with k parts, let UN (E) denote the uniform distribution over all k-subsets of [N ]
containing exactly one element from each part of E.
Lemma 7.5 (Reduction to k-Partite Bernoulli Submatrix Problems). Let 0 < q < p ≤ 1 and Q = 1 −
p √
(1 − p)(1 − q) + 1{p=1} q − 1 . Suppose that n and N are such that
 
p
n≥ + 1 N and k ≤ QN/4
Q

46
Algorithm T O -k-PARTITE -S UBMATRIX
Inputs: k- PDS instance G ∈ GN with clique size k that divides N and partition  E of[N ], edge
−O(1) p
probabilities 0 < q < p ≤ 1 with q = N and target dimension n ≥ Q + 1 N where
p √ 
Q = 1 − (1 − p)(1 − q) + 1{p=1} q − 1 and k divides n
p
1. Apply G RAPH -C LONE to G with edge probabilities P = p and Q = 1 − (1 − p)(1 − q) +
√ 
1{p=1} q − 1 and t = 2 clones to obtain (G1 , G2 ).

2. Let F be a partition of [n] with [n] = F1 ∪ F2 ∪ · · · ∪ Fk and |Fi | = n/k. Form the matrix
MPD ∈ {0, 1}n×n as follows:

(1) For each t ∈ [k], sample st1 ∼ Bin(N/k, p) and st2 ∼ Bin(n/k, Q) and let St be a subset of
Ft with |St | = N/k selected uniformly at random. Sample T1t ⊆ St and T2t ⊆ Ft \St with
|T1t | = st1 and |T2t | = max{st2 − st1 , 0} uniformly at random.
(2) Now form the matrix MPD such that its (i, j)th entry is


 1{πt (i),πt (j)}∈E(G1 ) if i < j and i, j ∈ St
 1{πt (i),πt (j)}∈E(G2 ) if i > j and i, j ∈ St



(MPD )ij = 1{i∈T1t } if i = j and i, j ∈ St
if i = j and i, j ∈ Ft \St

 1{i∈T2t }



∼i.i.d. Bern(Q) if i 6= j and (i, j) 6∈ St2 for a t ∈ [k]

where πt : St → Et is a bijection chosen uniformly at random.

3. Output the matrix MPD and the partition F .

Figure 5: Subroutine T O -k-PARTITE -S UBMATRIX for mapping from an instance of k-partite planted dense subgraph
to a k-partite Bernoulli submatrix problem.

Also suppose that q = N −O(1) and both N and n are divisible by k. Let E = (E1 , E2 , . . . , Ek ) and
F = (F1 , F2 , . . . , Fk ) be partitions of [N ] and [n], respectively. Then it follows that the algorithm A =
T O -k-PARTITE -S UBMATRIX runs in poly(N ) time and satisfies
 r
Q2 N 2 CQ k 2


dTV A(G(N, UN (E), p, q)), M[n]×[n] (Un (F ), Bern(p), Bern(Q)) ≤ 4k · exp − +
48pkn 2n
2 2
 
Q N
dTV A(G(N, q)), Bern (Q)⊗n×n ≤ 4k · exp −

48pkn
n o
Q
where CQ = max 1−Q , 1−Q
Q .

For completeness, we give an intuitive summary of the technical subtleties arising in the proof of this
lemma. After applying G RAPH -C LONE, the adjacency matrix of the input graph G is still missing its diag-
onal entries. The main difficulty in producing these diagonal entries is to ensure that entries corresponding
to vertices in the planted subgraph are properly sampled from Bern(p). To do this, we randomly embed
the original N × N adjacency matrix in a larger n × n matrix with i.i.d. entries from Bern(Q) and sample

47
Algorithm 3- SRK(B, P+ , P− , Q)
Parameters: Input B ∈ {−1, 0, 1}, number of iterations N , parameters a ∈ (0, 1) and sufficiently small
nonzero µ1 , µ2 ∈ R, distributions P+ , P− and Q over a measurable space (X, B) such that (P+ , Q) and
(P− , Q) are computable pairs

1. Initialize z arbitrarily in the support of Q.

2. Until z is set or N iterations have elapsed:

(1) Sample z 0 ∼ Q independently and compute the two quantities

dP+ 0 dP− 0 dP+ 0 dP− 0


L1 (z 0 ) = (z ) − (z ) and L2 (z 0 ) = (z ) + (z ) − 2
dQ dQ dQ dQ

(2) Proceed to the next iteration if it does not hold that

2|µ2 |
2|µ1 | ≥ L1 (z 0 ) ≥ |L2 (z 0 )|

and
max{a, 1 − a}

(3) Set z ← z 0 with probability PA (x, B) where



 1 + 4µa 2 · L2 (z 0 ) + 1
4µ1 · L1 (z 0 ) if B = 1
1  0
PA (x, B) = · 1 − 1−a
4µ2 · L2 (z ) if B = 0
2  1 + 1 · L (z 0 ) − a 0
4µ2 2 4µ1 · L1 (z ) if B = −1

3. Output z.

Figure 6: 3-ary symmetric rejection kernel algorithm.

all diagonal entries corresponding to entries of the original matrix from Bern(p). The diagonal entries in
the new n − N columns are chosen so that the supports on the diagonals within each Ft each have size
Bin(n/k, Q). Even though this causes the sizes of the supports on the diagonals in each Ft to have the same

distribution under both H0 and H1 , the randomness of the embedding and the fact that k = o( n) ensures
that this is hidden in total variation.

7.3 Symmetric 3-ary Rejection Kernels


In this section, we introduce symmetric 3-ary rejection kernels, which will be the key gadget in our reduc-
tion showing universality of lower bounds for learning sparse mixtures in Section 16. In order to map to
universal formulations of sparse mixtures, it is crucial to produce a nontrivial instance of a sparse mixture
with multiple planted distributions. Since previous rejection kernels all begin with binary inputs, they do
not have enough degrees of freedom to map to three output distributions. The symmetric 3-ary rejection
kernels 3- SRK introduced in this section overcome this issue by mapping from distributions supported on
{−1, 0, 1} to three output distributions P+ , P− and Q. In order to produce clean total variation guarantees,
these rejection kernels also exploit symmetry in their three input distributions on {−1, 0, 1}.
Let Tern(a, µ1 , µ2 ) where a ∈ (0, 1) and µ1 , µ2 ∈ R denote the probability distribution on {−1, 0, 1}

48
such that if B ∼ Tern(a, µ1 , µ2 ) then
1−a 1−a
P[X = −1] = − µ1 + µ2 , P[X = 0] = a − 2µ2 , P[X = 1] = + µ1 + µ2
2 2
if all three of these probabilities are nonnegative. The map 3- SRK(B), shown in Figure 6, sends an input
B ∈ {−1, 0, 1} to a set X simultaneously satisfying three Markov transition properties:
1. if B ∼ Tern(a, µ1 , µ2 ), then 3- SRK(B) is close to P+ in total variation;
2. if B ∼ Tern(a, −µ1 , µ2 ), then 3- SRK(B) is close to Q in total variation; and
3. if B ∼ Tern(a, 0, 0), then 3- SRK(B) is close to P− in total variation.
In order to state our main results for 3- SRK(B), we will need the notion of computable pairs from [BBH19].
The definition below is that given in [BBH19], without the assumption of finiteness of KL divergences. This
assumption was convenient for the Chernoff exponent analysis needed for multivariate rejection kernels in
[BBH19]. Since our rejection kernels are univariate, we will be able to state our universality conditions
directly in terms of tail bounds rather than Chernoff exponents.
Definition 7.6 (Relaxed Computable Pair [BBH19]). Define a pair of sequences of distributions (P, Q)
over a measurable space (X, B) where P = (Pn ) and Q = (Qn ) to be computable if:
1. there is an oracle producing a sample from Qn in poly(n) time;
2. for all n, Pn and Qn are mutually absolutely continuous and the likelihood ratio satisfies
  " −1 #
dPn dPn
Ex∼Qn (x) = Ex∼Pn (x) =1
dQn dQn
dPn
where dQn is the Radon-Nikodym derivative; and
dPn
3. there is an oracle computing dQn (x) in poly(n) time for each x ∈ X.
We remark that the second condition above always holds for discrete distributions and generally for
most well-behaved distributions P and Q. We now state our main total variation guarantees for 3- SRK. The
proof of the next lemma follows a similar structure to the analysis of rejection sampling as in Lemma 5.1
of [BBH18] and Lemma 5.1 of [BBH19]. However, the bounds that we obtain are different than those in
[BBH18, BBH19] because of the symmetry of the three input Tern distributions. The proof of this lemma is
deferred to Appendix A.3.
Lemma 7.7 (Symmetric 3-ary Rejection Kernels). Let a ∈ (0, 1) and µ1 , µ2 ∈ R be nonzero and such that
Tern(a, µ1 , µ2 ) is well-defined. Let P+ , P− and Q be distributions over a measurable space (X, B) such
that (P+ , Q) and (P− , Q) are computable pairs with respect to a parameter n. Let S ⊆ X be the set
 
dP+ dP− 2|µ2 | dP+ dP−
S = x ∈ X : 2|µ1 | ≥ (x) − (x) and
≥ (x) + (x) − 2
dQ dQ max{a, 1 − a} dQ dQ
Given a positive integer N , then the algorithm 3- SRK : {−1, 0, 1} → X can be computed in poly(n, N )
time and satisfies that

dTV (3- SRK(Tern(a, µ1 , µ2 )), P+ )  
 N

−1 −1
 1 −1 −1
dTV (3- SRK(Tern(a, −µ1 , µ2 )), P− ) ≤ 2δ 1 + |µ1 | + |µ2 | + + δ 1 + |µ1 | + |µ2 |
2
dTV (3- SRK(Tern(a, 0, 0)), Q)

where δ > 0 is such that PX∼P+ [X 6∈ S], PX∼P− [X 6∈ S] and PX∼Q [X 6∈ S] are upper bounded by δ.

49
8 Dense Bernoulli Rotations
In this section, we formally introduce dense Bernoulli rotations and constructions for their design matrices
and tensors, which will play an essential role in all of our reductions. For an overview of the main high
level ideas underlying these techniques, see Sections 4.2 and 4.3. As mentioned in Sections 4.2, dense
Bernoulli rotations map PB(T, i, p, q) to N µλ−1 · Ai , Im for each i ∈ [T ] and Bern(q)⊗T to N (0, Im )
approximately in total variation, where µ = Θ̃(1), the vectors A1 , A2 , . . . , AT ∈ Rm are for us to design
and λ is an upper bound on the singular values of the matrix with columns Ai .
Simplifying some technical details, our reduction to RSME in Section 10.1 roughly proceeds as follows:
(1) its input is a k- BPC instance with parts of size M and N and biclique dimensions k = kM and kN ; (2)
it applies dense Bernoulli rotations with p = 1 and q = 1/2 to the M kN vectors of length T = N/kN
representing the adjacency patterns in {0, 1}N/kN between each of the M left vertices and each part in the
partition of the right vertices; and (3) it pads the resulting matrix with standard normals so that it has d rows.
Under H1 , the result is a d × kN m matrix 1S u> + N (0, 1)⊗d×kN m where S is the left vertex set of the
biclique and u consists of scaled concatenations of the Ai . We design the adversary so that the target data
matrix D in RSME is roughly of the form
−1/2 , 1
 
 N τk  if i ∈ S and j is not corrupted
Dij ∼ N −1 (1 − )τ k −1/2 , 1 if i ∈ S and j is corrupted
N (0, 1)

otherwise

for each i ∈ [d] and j ∈ [n] where n = kN m. Matching the two distributions above, we arrive at the
following desiderata for the Ai .

• We would like each λ−1 Ai to consist of (1 − 0 )m entries equal to τ k −1/2 and 0 m entries equal to
0−1 (1 − 0 )τ k −1/2 where τ is just below the desired computational barrier τ = Θ̃(k 1/2 1/2 n−1/4 )
and 0 ≤  where 0 = Θ().

• Now observe that the norm of any such λ−1 Ai is Θ τ −1/2 m1/2 k −1/2 which is just below a norm


of Θ̃(m1/2 n−1/4 ) at the computational barrier for RSME. Note that the normalization by λ−1 ensures
that each λ−1 Ai has `2 norm at most 1. To be as close to the computational barrier as possible, it is
necessary that m1/2 n−1/4 = Θ̃(1) which rearranges to m = Θ̃(kN ) since n = kN m.

• When the input is an instance of k- BPC nearly at its computational barrier, we have that N = Θ̃(kN2 )

and thus our necessary condition above implies that m = Θ̃(N/kN ) = Θ̃(T ), and hence that A is
nearly square. Furthermore, if we take the Ai to be unit vectors, our desiderata that the λ−1 Ai have
norm Θ̃(m1/2 n−1/4 ) reduces to λ = Θ̃(1).

Summarizing this discussion, we arrive at exactly the three conditions outlined in Section 4.3. We remark
that while these desiderata are tailored to RSME, they will also turn out to be related to the desired properties
of A in our other reductions. We now formally introduce dense Bernoulli rotations.

8.1 Mapping Planted Bits to Spiked Gaussian Tensors


Let PB(n, i, p, q) and PB(S, i, p, q) denote the planted bit distributions defined in Sections 4.2 and 6.4. The
procedures B ERN -ROTATIONS and its derivative T ENSOR -B ERN -ROTATIONS are shown in Figure 7. Recall
that the subroutine G AUSSIANIZE was introduced in Figure 3. Note that positive semidefinite square roots
of n × n matrices can be computed in poly(n) time. The two key Markov transition properties for these
procedures that will be used throughout the paper are as follows.

50
Algorithm B ERN -ROTATIONS
Inputs: Vector V ∈ {0, 1}n , rejection kernel parameter RRK , Bernoulli probability parameters 0 < q <
p ≤ 1, output dimension m, an m × n matrix A with singular values all at most λ > 0, intermediate
mean parameter µ > 0

1. Form V1 ∈ {0, 1}n by applying G AUSSIANIZE to the entries in the vector V with rejection kernel
parameter RRK , Bernoulli probabilities q and p and target mean parameters all equal to µ.
1/2
2. Sample a vector U ∼ N (0, 1)⊗m and let Im − λ−2 · AA> be the positive semidefinite square
root of Im − λ−2 · AA> . Now form the vector
 1/2
V2 = λ−1 · AV1 + Im − λ−2 · AA> U

3. Output the vector V2 .

Algorithm T ENSOR -B ERN -ROTATIONS


Inputs: Order s tensor T ∈ Ts,n ({0, 1}), rejection kernel parameter RRK , Bernoulli probability parame-
ters 0 < q < p ≤ 1, output dimension m, an m × n matrices A1 , A2 , . . . , As with singular values less
than or equal to λ1 , λ2 , . . . , λs > 0, respectively, mean parameter µ > 0
s
1. Flatten T into the vector V1 ∈ {0, 1}n , form the Kronecker product A = A1 ⊗ A2 ⊗ · · · ⊗ As and
set λ = λ1 λ2 · · · λs .

2. Let V2 be the output of B ERN -ROTATIONS applied to V1 with parameters RRK , 0 < q < p ≤
1, A, λ, µ and output dimension ms .

3. Rearrange the entries of V2 into a tensor T1 ∈ Ts,m (R) and output T1 .

Figure 7: Subroutines B ERN -ROTATIONS and T ENSOR -B ERN -ROTATIONS for producing spiked Gaussian vectors
and tensors, respectively, from the planted bits distribution.

Lemma 8.1 (Dense Bernoulli Rotations). Let m and n be positive integers and let A ∈ Rm×n be a matrix
with singular values all at most λ > 0. Let RRK , 0 < q < p ≤ 1 and µ be as in Lemma 7.1. Let A
denote B ERN -ROTATIONS applied with rejection kernel parameter RRK , Bernoulli probability parameters
0 < q < p ≤ 1, output dimension m, matrix A with singular value upper bound λ and mean parameter µ.
Then A runs in poly(n, RRK ) time and it holds that

dTV A (PB(n, i, p, q)) , N µλ−1 · A·,i , Im = O n · R−3


 
RK

dTV A Bern(q)⊗n , N (0, Im ) = O n · R−3


  
RK

for all i ∈ [n], where A·,i denotes the ith column of A.


Proof. Let A1 denote the first step of A = B ERN -ROTATIONS with input V and output V1 , and let A2
denote the second step of A with input V1 and output V2 . Fix some index i ∈ [n]. Now Lemma 7.2 implies

dTV (A1 (PB(n, i, p, q)) , N (µ · ei , In )) = O n · R−3



RK (1)

51
where ei ∈ Rn is the ith canonical basis vector. Suppose that V1 ∼ N (µ · ei , In ) and let V1 = µ · ei + W
where W ∼ N (0, In ). Note that the entries of AW are jointly Gaussian and Cov(AW ) = AA> . Therefore,
we have that  
AV1 = µ · A·,i + AW ∼ N µ · A·,i , AA>
1/2
If U ∼ N (0, 1)⊗m is independent of W , then the entries of AW + λ2 · Im − ·AA> U are jointly
Gaussian. Furthermore, since both terms are mean zero and independent the covariance matrix of this
vector is given by
  1/2   1/2 
2 > 2 >
Cov AW + λ · Im − AA U = Cov (AW ) + Cov λ · Im − AA U

= AA> + (λ2 · Im − AA> ) = λ2 · Im


1/2
Therefore it follows that AW + λ2 · Im − AA> U ∼ N (0, λ2 · Im ) and furthermore that
 1/2
V2 = λ−1 · AV1 + Im − λ−2 · AA> U ∼ N µλ−1 · A·,i , Im


Where V2 ∼ A2 (N (µ · ei , In )). Now applying A2 to both distributions  in Equation (1) and


 the data-
processing inequality prove that dTV A (PB(n, i, p, q)) , N µλ−1 · A·,i , Im = O n · R−3 RK . This argu-
ment analyzing A2 applied with µ = 0 yields that A2 (N (0, In )) ∼ N (0, Im ). Combining this with
dTV A1 Bern(q)⊗n , N (0, In ) = O n · R−3
  
RK

from Lemma 7.2 now yields the bound dTV (A (Bern(q)⊗n ) , N (0, In )) = O n · R−3

RK , which completes
the proof of the lemma.

Corollary 8.2 (Tensor Bernoulli Rotations). Let s, m and n be positive integers, let A1 , A2 , . . . , As ∈ Rm×n
be matrices with singular values less than or equal to λ1 , λ2 , . . . , λs > 0, respectively. Let RRK , 0 < q <
p ≤ 1 and µ be as in Lemma 7.1. Let A denote T ENSOR -B ERN -ROTATIONS applied with parameters
0 < q < p ≤ 1, output dimension m, matrix A = A1 ⊗ A2 ⊗ · · · ⊗ As with singular value upper bound
λ = λ1 λ2 · · · λs and mean parameter µ. If s is a constant, then A runs in poly(n, RRK ) time and it holds
that for each e ∈ [n]s ,
dTV A (PBs (n, e, p, q)) , N µ(λ1 λ2 · · · λs )−1 · A·,e1 ⊗ A·,e2 ⊗ · · · ⊗ A·,es , Im
⊗s
= O ns · R−3
 
RK
  ⊗s
 
dTV A Bern(q)⊗n ⊗s
= O ns · R−3

, N 0, Im RK

where A·,i denotes the ith column of A.


Proof. Let σij for 1 ≤ i ≤ rj be the nonzero singular values of Aj for each 1 ≤ j ≤ s. Then the nonzero
singular values of the Kronecker product A = A1 ⊗ A2 ⊗ · · · ⊗ As are all of the products σi11 σi22 · · · σiss
for all (i1 , i2 , . . . , is ) with 1 ≤ ij ≤ rj for each 1 ≤ j ≤ s. Thus if σij ≤ λj for each 1 ≤ j ≤ s, then
λ = λ1 λ2 · · · λs is an upper bound on the singular values of A. The corollary now follows by applying
Lemma 8.1 with parameters p, q, µ and λ, matrix A, output dimension ms and input dimension ns .

8.2 Ftr Design Matrices


In this section, we introduce a family of matrices Kr,t that plays a key role in constructing the matrices A in
our applications of dense Bernoulli rotations. Throughout this section, r will denote a prime number and t
will denote a fixed positive integer. As outlined in the beginning of this section and in Section 4.3, there are
three desiderata of the matrices Kr,t that are needed for our applications of dense Bernoulli rotations. In the
context of Kr,t , these three properties are:

52
1. The rows of Kr,t are unit vectors and close to orthogonal in the sense that the largest singular value
of Kr,t is bounded above by a constant.

2. The matrices Kr,t both contain exactly two distinct real values as entries.

3. The matrices Kr,t contain a fraction of approximately 1/r negative entries per column.

The matrices Kr,t are constructed based on the incidence structure of the points in Ftr with the Grassmanian
of hyperplanes in Ftr and their affine shifts. The construction of Kr,t is motivated by the projective geometry
codes and their applications to constructing 2-block designs. We remark that a classic trick counting the
number of ordered d-tuples of linearly independent vectors in Ftr shows that the number of d-dimensional
subspaces of Ftr is
(rt − 1)(rt − r) · · · (rt − rd−1 )
|Gr(d, Ftr )| = d
(r − 1)(rd − r) · · · (rd − rd−1 )
−1 t
This implies that the number of hyperplanes in Ftr is ` = rr−1 . We now give the definition of the matrix
t
Kr,t as a weighted incidence matrix between the points of Fr and affine shifts of the hyperplanes in the
Grassmanian Gr(t − 1, Ftr ).

Definition 8.3 (Design Matrices Kr,t ). Let P1 , P2 , . . . , Prt be an enumeration of the points in Ftr and
t −1
V1 , V2 , . . . , V` , where ` = rr−1 , be an enumeration of the hyperplanes in Ftr . For each Vi , let ui 6= 0
t
denote a vector in Ftr not contained in Vi . Define Kr,t ∈ Rr`×r to be the matrix with the following entries

1 1 if Pj ∈6 Vi + aui
(Kr,t )r(i−1)+a+1,j = ·
− j ∈ Vi + aui
p
t
r (r − 1) 1 r if P

for each a ∈ {0, 1, . . . , r − 1} where Vi + v denotes the affine shift of Vi by v.

We now establish the key properties of Kr,t in the following simple lemma. Note that the lemma implies
that the submatrix consisting of the rows of Kr,t corresponding to hyperplanes in Ftr has rows that are exactly
orthogonal. However, the additional rows of Kr,t corresponding to affine shifts of these hyperplanes will
prove crucial in preserving tightness to algorithms in our average-case reductions. As established in the
subsequent lemma, these additional rows only mildly perturb the largest singular value of the matrix.

Lemma 8.4 (Sub-orthogonality of Kr,t ). If r ≥ 2 is prime, then Kr,t satisfies that:

1. for each 1 ≤ i ≤ kr`, it holds that k(Kr,t )i k2 = 1;

2. the inner product between the rows (Kr,t )i and (Kr,t )j where i 6= j are given by

−(r − 1)−1 if b(i − 1)/rc = b(j − 1)/rc



h(Kr,t )i , (Kr,t )j i =
0 otherwise

rt −1
3. each column of Kr,t contains exactly r−1 entries equal to √ 1−r
t
.
r (r−1)

Proof. Let ri denote the ith row (Kr,t )i of Kr,t . Fix a pair 1 ≤ i < j ≤ r` and let 1 ≤ i0 ≤ j 0 ≤ ` and
a, b ∈ {0, 1, . . . , r − 1} be such that i = r(i0 − 1) + a and j = r(j 0 − 1) + b. The affine subspaces of Ftr
corresponding to ri and rj are then Ai = Vi0 + aui0 and Aj = Vj 0 + buj 0 . Observe that

1 (1 − r)2
kri k22 = (rt − |Ai |) · + |A i | · =1
rt (r − 1) rt (r − 1)

53
Similarly, we have that

1 1−r (1 − r)2
hri , rj i = (rt − |Ai ∪ Aj |) · + (|A i ∪ A j | − |A i ∩ A j |) · + |A i ∩ A j | ·
rt (r − 1) rt (r − 1) rt (r − 1)
for each 1 ≤ i, j ≤ r`. Since the size of a subspace is invariant under affine shifts, we have that |Ai | =
|Vi0 | = |Aj | = |Vj 0 | = rt−1 . Furthermore, since Ai ∩ Aj is the intersection of two affine shifts of subspaces
of dimension t − 1 of Ftr , it follows that Ai ∩ Aj is either empty, an affine shift of a (t − 2)-dimensional
subspace or equal to both Ai and Aj . Note that if i 6= j, then Ai and Aj are distinct. We remark that when
t = 1, each Ai is an affine shift of the trivial hyperplane {0} and thus is a singleton. Now note that the
intersection Ai ∩ Aj is only empty if Ai and Aj are affine shifts of one another which occurs if and only if
b(i − 1)/rc = i0 = j 0 = b(j − 1)/rc. In this case, it follows that |Ai ∪ Aj | = |Ai | + |Aj | = 2rt−1 . In this
case, we have
1 1−r
hri , rj i = (rt − 2rt−1 ) · + 2rt−1 · t = −(r − 1)−1
rt (r − 1) r (r − 1)

If i0 6= j 0 , then Ai ∩ Aj is the affine shift of a (t − 2)-dimensional subspace which implies that |Ai ∩ Aj | =
rt−1 . Furthermore, |Ai ∪ Aj | = |Ai | + |Aj | − |Ai ∩ Aj | = 2rt−1 − rt−2 . In this case, we have that
1 1 r−1
hri , rj i = (r − 1)2 · − 2(r − 1) · 2 + 2 = 0
r2 (r − 1) r r
This completes the proof of (2). We remark that this last case never occurs if t = 1. Now note that any point
is in exactly one affine shift of each Vi . Therefore each column contains exactly ` negative entries, which
proves (3).

The next lemma uses the computation of h(Kr,t )i , (Kr,t )j i above to compute the singular values of Kr,t .
p
Lemma 8.5. The nonzero singular values of Kr,t are 1 + (r − 1)−1 with multiplicity (r − 1)`.
Proof. Lemma 8.4 shows that (Kr,t )(Kr,t )> is block-diagonal with ` blocks of dimension r × r. Further-
more, each block is of the form 1 + (r − 1)−1 Ir −(r −1)−1 11> . The eigenvalues of each of these blocks
are 1 + (r − 1)−1 with multiplicity r − 1 and 0 with multiplicity 1. Thus the eigenvalues of (Kr,t )(Kr,t )>
are 1 + (r − 1)−1 and 0 with multiplicities (r − 1)` and `, respectively, implying the result.

8.3 Ftr Design Tensors


(V ,V ,L)
In this section, we introduce a family of tensors Tr,t i j that will be used in T ENSOR -B ERN -ROTATIONS
in the matrix case with s = 2 to map to hidden partition models in Section 14.2. An overview of how these
tensors will be used in dense Bernoulli rotations was given in Section 4.3. Similar to the previous section,
(V ,V ,L)
the Tr,t i j are constructed to have the following properties:
(V ,Vj ,L)
1. Given a pair of hyperplanes (Vi , Vj ) and a linear function L : Fr → F r , the slice Tr,t i of the
(V ,V ,L)
constructed tensor is an rt × rt matrix with Frobenius norm Tr,t i j = 1.
F

2. These slices are approximately orthogonal in the sense that the Gram matrix with entries given by the
(V 0 ,Vj 0 ,L0 )
 
(V ,Vj ,L)
matrix inner products Tr Tr,t i · Tr,t i has a bounded spectral norm.

(V ,V ,L)
3. Each slice Tr,t i j contains two distinct entries and is an average signed adjacency matrix of a
hidden partition model i.e. has these two entries arranged into an r-block community structure.

54
(V ,V ,L)
4. Matrices formed by specific concatentations of Tr,t i j into larger matrices remain the average
signed adjacency matrices of hidden partition models. This will be made precise in Lemma 8.11 and
will be important in our reduction from k- PC.
(V ,V ,L)
The construction of the family of tensors Tr,t i j is another construction using the incidence geometry of
Ftr , but is more involved than the two constructions in the previous section. Throughout this section, we let
V1 , V2 , . . . , V` and P1 , P2 , . . . , Prt be an enumeration of the hyperplanes and points of Ftr as in Definition
8.3. Furthermore, for each Vi , we fix a particular point ui 6= 0 of Ftr not contained in Vi . In order to introduce
(V ,V ,L)
the family Tr,t i j , we first define the following important class of bipartite graphs.
Definition 8.6 (Affine Block Graphs Gr,t ). For each 1 ≤ i ≤ `, let Ai0 ∪ Ai1 ∪ · · · ∪ Air−1 be the partition
of Ftr given by the affine shifts Aix = (Vi + xui ) for each x ∈ Fr . Given two hyperplanes Vi , Vj and linear
function L : Fr → Fr , define the bipartite graph Gr,t (Vi , Vj , L) with two parts of size rt , each indexed by
points in Ftr , as follows:
1. all of the edges between the points with indices in Aix in the left part of Gr,t (Vi , Vj , L) and the points
with indices in Ajy in the right part are present if L(x) = y; and
2. none of the edges between the points of Aix on the left and Ajy on the right are present if L(x) 6= y.
We now define the slices of the tensor Tr,t to be weighted adjacency matrices of the bipartite graphs
Gr,t (Vi , Vj , L) as in the following definition.
Definition 8.7 (Design Tensors Tr,t ). For any two hyperplanes Vi , Vj and linear function L : Fr → Fr ,
(V ,V ,L)
define the rt × rt matrix Tr,t i j to have entries given by


(V ,V ,L)
 1 r − 1 if (Pk , Pl ) ∈ E (Gr,t (Vi , Vj , L))
Tr,t i j = t√ ·
k,l r r−1 −1 otherwise

for each 1 ≤ k, l ≤ rt .
The next two lemmas establish that the tensor Tr,t satisfies the four desiderata discussed above, which
will be crucial in our reduction to hidden partition models.
Lemma 8.8 (Sub-orthogonality of Tr,t ). If r ≥ 2 is prime, then Tr,t satisfies that:

t (Vi ,Vj ,L)
1. for each 1 ≤ i, j ≤ r and linear function L, it holds that Tr,t = 1;
F

(V ,V ,L) (V 0 ,V 0 ,L0 )
2. the inner product between the slices Tr,t i j and Tr,t i j where (Vi , Vj , L) 6= (Vi0 , Vj 0 , L0 ) is
 (Vi0 ,Vj 0 ,L0 )
  −(r − 1)−1 if (V , V ) = (V 0 , V 0 ) and L = L0 + a for some a 6= 0
(Vi ,Vj ,L) i j i j
Tr Tr,t · Tr,t =
0 if (Vi , Vj ) 6= (Vi0 , Vj 0 ) or L 6= L0 + a for all a ∈ Fr

Proof. Fix two triples (Vi , Vj , L) and (Vi0 , Vj 0 , L0 ) and let G1 = Gr,t (Vi , Vj , L) and G2 = Gr,t (Vi0 , Vj 0 , L0 ).
Now observe that

(V ,V ,L) (V 0 ,V 0 ,L0 )
 1
Tr Tr,t i j · Tr,t i j = 2t · (r − 1)2 · |E(G1 ) ∩ E(G2 )|
r (r − 1)
1
− 2t · (r − 1) · (|E(G1 ) ∪ E(G2 )| − |E(G1 ) ∩ E(G2 )|)
r (r − 1)
1
· r2t − |E(G1 ) ∪ E(G2 )|

+ 2t (2)
r (r − 1)

55
Now note that since L is a function, there are exactly r pairs (x, y) ∈ F2r such that L(x) = y and thus
exactly r pairs of left and right sets (Aix , Ajy ) that are completely connected by edges in G1 . This implies
that there are |E(G1 )| = |E(G2 )| = r2t−1 edges in both G1 and G2 . We now will show that

if (Vi , Vj , L) = (Vi0 , Vj 0 , L0 )
 2t−1
 r
|E(G1 ) ∩ E(G2 )| = r2t−2 if (Vi , Vj ) 6= (Vi0 , Vj 0 ) or L 6= L0 + a for all a ∈ Fr (3)
if (Vi , Vj ) = (Vi0 , Vj 0 ) and L = L0 + a for some a 6= 0

0

We remark that, as in the proof of Lemma 8.4, it is never true that (Vi , Vj ) 6= (Vi0 , Vj 0 ) if t = 1. The first
case follows immediately from the fact that |E(G1 )| = r2t−1 . Now consider the case in which Vi 6= Vi0 and
0
Vj 6= Vj 0 . As in the proof of Lemma 8.4, any pair of affine spaces Aix and Aix0 either intersects in an affine
0
space of dimension t − 2, an affine space of dimension t − 1 if Aix = Aix0 are equal and in the empty set if
0
Aix and Aix0 are affine shifts of one another. Since Vi 6= Vi0 , only the first of these three options is possible.
0 0 0 0
Therefore, for all x, x0 , y, y 0 ∈ Fr , it follows that (Aix × Ajy ) ∩ (Aix0 × Ajy0 ) = (Aix ∩ Aix0 ) × (Ajy × Ayj 0 ) has
0 0
size r2t−4 since both Aix ∩ Aix0 and Ajy × Ajy0 are affine spaces of dimension t − 2. Now observe that
  0 j0
X X 
i
|E(G1 ) ∩ E(G2 )| = Ax × Ajy ∩ Aix0 × Ay0 = r2 · r2t−4 = r2t−2
L(x)=y L0 (x0 )=y 0

since there are exactly r pairs (x, y) with L(x) = y. Now suppose that Vi = Vi0 and Vj 6= Vj 0 . In this case,
0
we have that Aix ∩ Aix0 is empty if x 6= x0 and otherwise has size |Aix | = rt−1 . Thus it follows that
  r2t−3 if x = x0
i j
  i0 j0
A
x × A y ∩ A x0 × Ay0 = 0 otherwise

This implies that


  0 j0
X X 
i
|E(G1 ) ∩ E(G2 )| = Ax × Ajy ∩ Aix0 × Ay0 = r · r2t−3 = r2t−2
L(x)=y L0 (x0 )=y 0

since for each fixed x = x0 , there is a unique pair (y, y 0 ) with L(x) = y and L(x0 ) = y 0 . The case in which
Vi 6= Vi0 and Vj = Vj 0 is handled by a symmetric argument. Now suppose that (Vi , Vj ) = (Vi0 , Vj 0 ). It
0 0
follows that (Aix × Ajy ) ∩ (Aix0 × Ajy0 ) has size r2t−2 if x = x0 and y = y 0 , and is empty otherwise. The
formula above therefore implies that |E(G1 ) ∩ E(G2 )| is r2t−2 times the number of solutions to L(x) =
L0 (x). Since L − L0 is linear, the number of solutions is 0 if L − L0 is constant and not equal to zero,
1 if L − L0 is not constant or r if L = L0 . This completes the proof of Equation (3). Now observe that
|E(G1 ) ∪ E(G2 )| = |E(G1 )| + |E(G2 )| − |E(G1 ) ∩ E(G2 )| = 2r2t−1 − |E(G1 ) ∩ E(G2 )|. Substituting
this expression for |E(G1 ) ∪ E(G2 )| into Equation (2) yields that

(V ,V ,L) (V 0 ,V 0 ,L0 )
 r2 1
Tr Tr,t i j · Tr,t i j = 2t
· |E(G1 ) ∩ E(G2 )| −
r (r − 1) r−1

Combining this with the different cases of Equation (3) shows part (2) of the lemma. Part (1) of the lemma
follows from this computation and fact that
  
(Vi ,Vj ,L) 2 (Vi ,Vj ,L) 2

Tr,t = Tr Tr,t
F

This completes the proof of the lemma.

56
We now define an unfolded matrix variant of the tensor Tr,t that will be used in our applications of
T ENSOR -B ERN -ROTATIONS to map to hidden partition models. The row indexing in Mr,t will be important
and related to the community alignment property of Tr,t that will be established in Lemma 8.11.

Definition 8.9 (Unfolded Matrix Mr,t ). Let Mr,t be an (r − 1)2 `2 × r2t matrix with entries given by
 (V ,V ,L
a+1,b+1 )

0 0
(Mr,t )a(r−1)`2 +i0 (r−1)`+b`+j 0 +1,irt +j+1 = Tr,t i +1 j +1
i,j

for each 0 ≤ i0 , j 0 ≤ (r − 1)` − 1, 0 ≤ a, b ≤ r − 2 and 0 ≤ i, j ≤ rt − 1, where Lc,d : Fr → Fr denotes


the linear function given by Lc,d (x) = cx + d.

The next lemma is similar to Lemma 8.5 and deduces the singular values of Mr,t from Lemma 8.8. The
proof is very similar to that of Lemma 8.5.
p
Lemma 8.10 (Singular Values of Mr,t ). The nonzero singular values of Mr,t are 1 + (r − 1)−1 with
multiplicity (r − 1)(r − 2)`2 and (r − 1)−1/2 with multiplicity (r − 1)`2 .

Proof. Observe that the rows of Mr,t are formed by vectorizing the slices of Tr,t . Thus Lemma 8.8 implies
that (Mr,t )(Mr,t )> is block-diagonal with (r − 1)`2 blocks of dimension (r − 1) × (r − 1), where each block
corresponds to slices with indices (Vi , Vj , Lc,d ) where i, j and c are fixed on over
 each block while d ranges
over {1, 2, . . . , r − 1}. Furthermore, each block is of the form 1 + (r − 1)−1 Ir−1 − (r − 1)−1 11> . The
eigenvalues of each of these blocks are 1 + (r − 1)−1 with multiplicity r − 2 and (r − 1)−1 with multiplicity
1. Thus the eigenvalues of (Mr,t )(Mr,t )> are 1+(r −1)−1 and (r −1)−1 with multiplicities (r −1)(r −2)`2
and (r − 1)`2 , respectively, which implies the result.

Given m2 matrices M 1,1 , M 1,2 , . . . , M k,k ∈ Rn×n , let C M 1,1 , M 1,2 , . . . , M k,k denote the matrix


X ∈ Rkn×kn formed by concatenating the M i,j with


a+1,c+1
Xan+b+1,cn+d+1 = Mb+1,d+1 for all 0 ≤ a, c ≤ k − 1 and 0 ≤ b, d ≤ n − 1

We refer to a matrix M ∈ Rn×n as a k-block matrix for some k that divides n if there are two values
x1 , x2 ∈ R and two partitions [n] = E1 ∪ E2 ∪ · · · ∪ Ek = F1 ∪ F2 ∪ · · · ∪ Fk both into parts of size n/k
such that 
x1 if (i, j) ∈ Eh × Fh for some 1 ≤ h ≤ k
Mij =
x2 otherwise
The next lemma shows an alignment property of different slices of Tr,t that will be crucial in stitching
together the local applications of T ENSOR -B ERN -ROTATIONS with Mr,t in our reduction to hidden partition
models. This lemma will use indexing the in Mr,t and the role of linear functions L in defining the affine
block graphs Gr,t .

Lemma 8.11 (Community Alignment in Tr,t ). Let 1 ≤ s1 , s2 , . . . , sk ≤ (r − 1)` be arbitrary indices and
(V 0 ,Vj 0 ,L)
M i,j = Tr,t i for each 1 ≤ i, j ≤ k

where i0 and j 0 are the unique 1 ≤ i0 , j 0 ≤ ` such that i0 ≡ si (mod `) and j 0 ≡ sj (mod `) and
L(x) = ax + b where a = dsi /`e and b = dsj /`e. Then it follows that C M 1,1 , M 1,2 , . . . , M k,k is an
r-block matrix.

57
Proof. Let ti = i0 be the unique 1 ≤ i0 ≤ ` such that i0 ≡ si (mod `) and let ai = dsi /`e ∈ {1, 2, . . . , r−1}
for each 1 ≤ i ≤ `. Furthermore, let Lij (x) = ai x + aj for 1 ≤ i, j ≤ k and, for each x ∈ R and 1 ≤ i ≤ `,
let Aix be the affine spaces as in Definition 8.6. Note that since 0 < ai < r, it follows that each Lij is
a non-constant and hence invertible linear function. Given a subset S ⊆ Ftr and some s ∈ N, let I(s, S)
denote the set of indices I(s, S) = {s + i : Pi ∈ S}.
Now define the partition [krt ] = E0 ∪ E2 ∪ · · · ∪ Er−1 as follows
k  
t
[
Ei = I (j − 1)rt , Axjij where xij = L−1
j1 (L11 (i))
j=1

and similarly define the partition [krt ] = F0 ∪ F2 ∪ · · · ∪ Fr−1 as follows


k  
t
[
Fi = I (j − 1)rt , Ayjij where yij = L1j (i)
j=1

t ×kr t
Let X ∈ Rkr denote the matrix X = C M 1,1 , M 1,2 , . . . , M k,k . We will show that


r−1
Xa,b = √ if (a, b) ∈ Ei × Fi for some 0 ≤ i ≤ r − 1 (4)
rt r−1

Suppose that (a, b) ∈ Ei × Fi and specifically that (ja − 1)rt + 1 ≤ a ≤ ja rt and (jb − 1)rt + 1 ≤ b ≤ jb rt
t
for some 1 ≤ ja , jb ≤ k. The definitions of Ei and Fi imply that za ∈ Axjija a where za = Pa−(ja −1)rt and
tj
zb ∈ Ayijb b where zb = Pb−(jb −1)rt . Note that
ja ,jb
Xa,b = Ma−(j t
a −1)r ,b−(jb −1)r
t

by the definition of C. Therefore by Definition 8.7, it suffices to show that (za , zb ) is an edge of the bipartite
graph Gr,t (Vtja , Vtjb , Lja jb ) for all such (a, b) to establish (4). By Definition 8.6, (za , zb ) is an edge if and
only if Lja jb (xija ) = yijb . Observe that the definitions of xija and yijb yield that

aja xija + a1 = Lja 1 (xija ) = L11 (i) = a1 · i + a1 (5)


yijb = L1jb (i) = a1 · i + ajb
Lja jb (x) = aja x + ajb

Adding ajb − a1 to both sides of Equation (5) therefore yields that

Lja jb (xija ) = aja xija + ajb = a1 · i + ajb = yijb

which√completes the proof of (4). Now note that each M i,j contains exactly r2t−1 entries equal to (r −
1)/rt r − 1 and thus X contains exactly k 2 r2t−1 such entries. The definitions of Ei and Fi imply that
they each contain exactly kr√t−1 elements. Thus ∪r−1 2 2t−1 elements. Therefore (4) also
i=0 Ei × Fi contains k r
t r−1
implies that Xa,b = −1/r r − 1 for all (a, b) 6∈ ∪i=0 Ei × Fi . This proves that X is an r-block matrix
and completes the proof of the lemma.

The community alignment property shown in this lemma is directly related to the indexing of rows in
Mr,t . More precisely, the above lemma implies that for any subset S ⊆ [(r − 1)`], the rows of Mr,t indexed
by elements in the support of 1S ⊗ 1S can be arranged as sub-matrices of an |S|rt × |S|rt matrix that is
an r-block matrix. This property will be crucial in our reduction from k- PC and k- PDS to hidden partition
models in Section 14.2.

58
8.4 A Random Matrix Alternative to Kr,t
In this section, we introduce the random matrix analogue Rn, of Kr,t defined below. Rather than have all
independent entries, Rn, is constrained to be symmetric. This ends up being technically convenient, as it
suffices to bound the eigenvalues of Rn, in order to upper bound its largest singular value. This symmetry
also yields a direct connection between the eigenvalues of Rn, and the eigenvalues of sparse random graphs,
which have been studied extensively.

Definition 8.12 (Random Design Matrix Rn, ). If  ∈ (0, 1/2], let Rn, ∈ Rn×n denote the random sym-
metric matrix with independent entries sampled as follows
 q
 − 1− with prob. 
(Rn, )ij = (Rn, )ji ∼ q n

 with prob. 1 − 
(1−)n

q

for all 1 ≤ i < j ≤ n, and (Rn, )ii = (1−)n for each 1 ≤ i ≤ n.

We now establish the key properties of the matrix Rn, . Consider the graph G where {i, j} ∈ E(G)
if and only if (Rn, )ij is negative. By definition, we have that G is an -sparse Erdős-Rényi graph with
G ∼ G(n, ). Furthermore, if A is the adjacency matrix of G, a direct calculation yields that Rn, can be
expressed as r
 1
Rn, = · In + p · (E[A] − A) (6)
(1 − )n (1 − )n
A line of work has given high probability upper bounds on the largest eigenvalue of E[A]−A in order to study
concentration of sparse Erdős-Rényi graphs in the spectral norm of their adjacency matrices [FK81, Vu05,
FO05, LP13, BVH+ 16, LLV17]. As outlined in [LLV17], the works [FK81, Vu05, LP13] apply Wigner’s
trace method to obtain spectral concentration results for general random matrices that, in this context, imply
with high probability that

kE[A] − Ak = 2 d (1 + on (1)) for d  (log n)4

where d = n and k·k denotes the spectral norm on n×n symmetric matrices. In [FO05, BVH+ 16, LLV17],
it is shown that this requirement on d can be relaxed and that it holds with high probability that

kE[A] − Ak = On ( d) for d = Ωn (log n)

These results, the fact that Rn, is symmetric and the above expression for Rn, in terms of A are enough to
establish our main desired properties of Rn, , which are stated formally in the following lemma.

Lemma 8.13 (Key Properties of Rn, ). If  ∈ (0, 1/2] satisfies that n = ωn (log n), there is a constant
C > 0 such that the random matrix Rn, satisfies the following two conditions with probability 1 − on (1):

1. the largest singular value of Rn, is at most C; and


√ √
2. every column of Rn, contains between n − C n log n and n + C n log n negative entries.

Proof. The number of negative entries in the ith column of Rn, is distributed as Bin(n − 1, ). A standard
Chernoff bound for the binomial distribution yields that if X ∼ Bin(n − 1, ), then
 2 
δ (n − 1)
P [|X − (n − 1)| ≥ δ(n − 1)] ≤ 2 exp −
3

59
p
for all δ ∈ (0, 1). Setting δ = C 0 n−1 −1 log n for a sufficiently large constant C 0 > 0 and taking a
union bound over all columns i implies that property (2) in the lemma statement holds with probability
1 − on (1). We now apply Theorem 1.1 in [LLV17] as in the first example in Section 1.4, where the graph is
not modified. Since n = ωn (log n), this yields with probability 1 − on (1) that

kE[A] − Ak ≤ C 00 d

for some constant C 00 > 0, where A and d are as defined above. The decomposition of Rn, in Equation (6)
now implies that with probability 1 − on (1)
r
 1 √
kRn, k ≤ +p · C 00 d = On (1)
(1 − )n (1 − )n

since  ∈ (0, 1/2] and d = n. This establishes that property (1) holds with probability 1 − on (1). A union
bound over (1) and (2) now completes the proof of the lemma.

While Rn, and Kr,t satisfy similar conditions needed by our reductions, they also have differences that
will dictate when one is used over the other. The following highlights several key points in comparing these
two matrices.
• Rn, and Kr,t p t
are analogous whenp n = r and  = 1/r. In this case, both matrices contain the same
two values 1/ r (r − 1) and − (r − 1)/rt . The rows of Kr,t are unit vectors and the rows of Rn,
t

are approximately
p unit vectors – property (2) in Lemma 8.13 implies that the norm of each row is
−1
1 ± On ( (n) log n). Like Kr,t , Lemma 8.13 implies that Rn, is also approximately orthogonal
with largest singular value bounded above by a constant.

• While Kr,t has exactly a (1/r)-fraction of entries in each column that are negative, Rn, only has
approximately an -fraction of entries in each of its columns that are negative. For some of our
reductions, such as our reductions to RSME and RSLR, having approximately an -fraction of its entries
equal to the negative value in Definition 8.12 is sufficient. In our reductions to ISBM, GHPM, BHPM
and SEMI - CR, it will be important that Kr,t contains exactly (1/r)-fraction of negative entries per
column. The approximate guarantee of Rn, would correspond to only showing lower bounds against
algorithms that are adaptive and do not need to know the sizes of the hidden communities.

• As is mentioned in Section 3.2 and will be discussed in Section 13, our applications of dense Bernoulli
rotations with Kr,t will generally be tight when a natural parameter n in our problems satisfies that

n = Θ̃(rt ). This imposes a number theoretic condition (T) on the pair (n, r), arising from the fact
that t must be an integer, which generally remains a condition in the computational lower bounds we
show for ISBM, GHPM and BHPM. In contrast, Rn, places no number-theoretic constraints on n and
, which can be arbitrary, and thus the condition (T) can be removed from our computational lower
bounds for RSME and RSLR. We remark that when r = no(1) , which often is the regime of interest in
problems such as RSME, then the condition ( T ) is trivial and places no further constraints on (n, r) as
will be shown in Lemma 13.2.

• Rn, is random while Kr,t is fixed. In our reductions, it is often important that the same design
matrix is used throughout multiple applications of dense Bernoulli rotations. Since Rn, is a random
matrix, this requires generating a single instance of Rn, and using this one instance throughout our
reductions. In each of our reductions, we will rejection sample Rn, until it satisfies the two criteria
in Lemma 8.13 for a maximum of O((log n)2 ) rounds, and then use the resulting matrix throughout
all applications of dense Bernoulli rotations in that reduction. The probability bounds in Lemma 8.13
imply that the probability no sample from Rn, satisfying these criteria is found is n−ωn (1) . This is

60
a failure mode for our reductions and contributes a negligible n−ωn (1) to the total variation distance
between the output of our reductions and their target distributions.
• For some of our reductions, applying dense Bernoulli rotations with either of the two matrices Rn, or
Kr,t yields the same guarantees. This is the case for our reductions to MSLR, GLSM and TPCA, where
r = 2 and the condition (T) is trivial and mapping to columns with approximately half of their entries
negative is sufficient. As mentioned above, this is also the case when r  −1 = no(1) in RSME.
• Some differences between Rn, and Kr,t that are unimportant for our reductions include that Rn, is
exactly square while Kr,t is only approximately square and that Rn, is symmetric while Kr,t is not.
For consistency, the pseudocode and analysis for all of our reductions are written with Kr,t rather than Rn, .
Modifying our reductions to use Rn, is straightforward and consists of adding the rejection sampling step
to sample Rn, discussed above. In Sections 10.1, 10.2 and 13, we discuss in more detail how to make these
modifications to our reductions to RSME and RSLR and the computational lower bounds they yield.
There are several reasons why we present our reductions with Kr,t rather than Rn, . The analysis of Kr,t
in Section 8.2 is simple and self-contained while the analysis of Rn, requires fairly involved results from
random matrix theory. The construction of Kr,t naturally extends to Tr,t while a random tensor analogue
Tr,t seems as though it would be prohibitively difficult to analyze. Reductions with Kr,t give an explicit
encoding of cliques into the hidden structure of our target problems as discussed in Section 4.8, yielding
slightly stronger and more explicit computational lower bounds in this sense.

9 Negatively Correlated Sparse PCA


This section is devoted to giving a reduction from bipartite planted dense subgraph to negatively correlated
sparse PCA, the high level of which was outlined in Section 4.5. This reduction will be used in the next
section as a crucial subroutine in reductions to establish conjectured statistical-computational gaps for two
supervised problems: mixtures of sparse linear regressions and robust sparse linear regression. The analysis
of this reduction relies on a result on the convergence of the Wishart distribution and its inverse. This result
is proven in the second half of this section.

9.1 Reducing to Negative Sparse PCA


In this section, we give our reduction BPDS - TO - NEG - SPCA from bipartite planted dense subgraph to nega-
tively correlated sparse PCA, which is shown in Figure 8. This reduction is described with the input bipartite
graph as its adjacency matrix of Bernoulli random variables. A key subroutine in this reduction is the proce-
dure χ2 -R ANDOM -ROTATION from [BB19b], which is also shown in Figure 8. The lemma below provides
total variation guarantees for χ2 -R ANDOM -ROTATION and is adapted from Lemma 4.6 from [BB19b] to be
in our notation and apply to the generalized case where the input matrix M is rectangular instead of square.
This lemma can be proven with an identical argument to that in Lemma 4.6 from [BB19b], with the
following adjustment of parameters to the rectangular case. The first two steps of χ2 -R ANDOM -ROTATION
maps M[m]×[n] (S × T, p, q) approximately to
r
τ kn
· 1S u>
T + N (0, 1)
⊗m×n
2 n
where uT is the vector with (uT )i = ri if i ∈ T and (uT )i = 0 otherwise. The argument in Lemma 4.6
from [BB19b] shows that the final step of χ2 -R ANDOM -ROTATION maps this distribution approximately to
r
τ kn
· 1S w> + N (0, 1)⊗m×n
2 n

61
Algorithm χ2 -R ANDOM -ROTATION
Inputs: Matrix M ∈ {0, 1}m×n , Bernoulli probabilities 0 < q < p ≤ 1, planted subset size kn that
divides n and a parameter τ > 0
p n p o
1. Sample r1 , r2 , . . . , rn ∼i.i.d. χ2 (n/kn ) and truncate the rj with rj ← min rj , 2 n/kn for
each j ∈ [n].

2. Compute M 0 by applying G AUSSIANIZE to M with Bernoulli probabilities p p and q, rejection


kernel parameter RRK = mn, parameter τ and target mean values µij = 21 τ · rj · kn /n for each
i ∈ [m] and j ∈ [n].

3. Sample an orthogonal matrix R ∈ Rn×n from the Haar measure on the orthogonal group On and
output the columns of the matrix M 0 R.

Algorithm BPDS - TO - NEG - SPCA


Inputs: Matrix M ∈ {0, 1}m×n , Bernoulli probabilities 0 < q < p ≤ 1, planted subset size kn that
divides n and a parameter τ > 0, target dimension d ≥ m

1. Compute X = (X1 , X2 , . . . , Xn ) where Xi ∈ Rm as the columns of the matrix output by


χ2 -R ANDOM -ROTATION applied to M with parameters p, q, kn and τ .

2. Compute Σ̂ = ni=1 Xi Xi> and let R ∈ Rm×n be the top m rows of an orthogonal matrix sampled
P
from the Haar measure on the orthogonal group On and compute the matrix

M 0 = n(n − m − 1) · Σ̂−1/2 R
p

where Σ̂−1/2 is the positive semidefinite square root of the inverse of Σ̂.

3. Output the columns of the d × n matrix with upper left m × n submatrix M 0 and all remaining
entries sampled i.i.d. from N (0, 1).

Figure 8: Subroutine χ2 -R ANDOM -ROTATION for random rotations to instances of sparse PCA from [BB19b] and
our reduction from bipartite planted dense subgraph to negative sparse PCA.

where w ∼ N (0, In ). Now observe that the entries of this matrix are zero mean and jointly Gaussian.
2
n |S|
Furthermore, the columns are independent and have covariance matrix Im + τ k4n · vS vS> where vS =
|S| −1/2 · 1S . Summarizing the result of this argument, we have the following lemma.

Lemma 9.1 (χ2 Random Rotations – Adapted from Lemma 4.6 in [BB19b]). Given parameters m, n, let
0 < q < p ≤ 1 be such that p − q = (mn)−O(1) and min(q, 1 − q) = Ω(1), let kn ≤ n be such that kn
divides n and let τ > 0 be such that
    
δ p 1−q
τ≤ p where δ = min log , log
2 6 log(mn) + 2 log(p − q)−1 q 1−p

62
The algorithm A = χ2 -R ANDOM -ROTATION runs in poly(m, n) time and satisfies that
⊗n !
τ 2 kn |S|

>
≤ O (mn)−1 + kn (4e−3 )n/2kn
 
dTV A M[m]×[n] (S × T, p, q) , N 0, Im + · vS vS
4n
dTV A Bern(q)⊗m×n , N (0, 1)⊗m×n = O (mn)−1
  

where vS = √1 · 1S ∈ Rm for all subsets S ⊆ [m] and T ⊆ [n] with |T | = kn .


|S|

Throughout the remainder of this section, we will need to use properties of the Wishart and inverse
Wishart distributions. These distributions on random matrices are defined as follows.
Definition 9.2 (Wishart Distribution). Let n and d be positive integers and Σ ∈ Rd×d bePa positive semidef-
inite matrix. The Wishart distribution Wd (n, Σ) is the distribution of the matrix Σ̂ = ni=1 Xi Xi> where
X1 , X2 , . . . , Xn ∼i.i.d. N (0, Σ).
Definition 9.3 (Inverse Wishart Distribution). Let n, d and Σ be as in Definition 9.2. The inverted Wishart
distribution Wd−1 (n, Σ) is the distribution of Σ̂−1 where Σ̂ ∼ Wd (n, Σ).
In order to analyze BPDS - TO - NEG - SPCA, we also will need the following observation from [BB19b].
This is a simple consequence of the fact that the distribution N (0, In ) is isotropic and thus invariant under
multiplication by elements of the orthogonal group On .
Lemma 9.4 (Lemma 6.5 in [BB19b]). Suppose that n ≥ d and let Σ ∈ Rd×d be a fixed positive definite
matrix and let Σe ∼ Wd (n, Σ). Let R ∈ Rd×n be the matrix consisting of the first d rows of an n × n matrix
chosen randomly and independently of Σe from the Haar measure µOn on On . Let (Y1 , Y2 , . . . , Yn ) be the
1/2
n columns of Σe R, then Y1 , Y2 , . . . , Yn ∼i.i.d. N (0, Σ).
We now will state and prove the main total variation guarantees for BPDS - TO - NEG - SPCA in the theorem
below. The proof of the theorem below crucially relies on the upper bound in Theorem 9.6 on the KL
divergence between Wishart matrices and their inverses. Proving this KL divergence bound is the focus of
the next subsection.
Theorem 9.5 (Reduction to Negative Sparse PCA). Let m, n, p, q, kn and τ be as in Lemma 9.1 and suppose
that d ≥ m and n  m3 as n → ∞. Fix any subset S ⊆ [m] and let θS be given by

τ 2 kn |S|
θS =
4n + τ 2 kn |S|
Then algorithm A = BPDS - TO - NEG - SPCA runs in poly(m, n) time and satisfies that
  ⊗n   
>
≤ O m3/2 n−1/2 + kn (4e−3 )n/2kn

dTV A M[m]×[n] (S × T, p, q) , N 0, Id − θS vS vS
   
dTV A Bern(q)⊗m×n , N (0, 1)⊗d×n = O m3/2 n−1/2


where vS = √1 · 1S ∈ Rd for all subsets S ⊆ [m] and T ⊆ [n] with |T | = kn .


|S|

Proof. Let A1 denote the application of χ2 -R ANDOM -ROTATION with input M and output X in Step 1 of
A. Let A2a denote the Markov transition with input X and output n(n − m − 1) · Σ̂−1 , as defined in Step 2
of A, and let A2b-3 denote the Markov transition with input Y = n(n − m − 1) · Σ̂−1 and output Z formed
by padding Y 1/2 R with i.i.d. N (0, 1) random variables to be d × n i.e. the output of A. Furthermore, let
A2-3 = A2b-3 ◦ A2a denote Steps 2 and 3 with input X and output Z.

63
Pn >
Now fix some positive semidefinite matrix Σ ∈ Rm×m and observe that if A = i=1 Zi Zi ∼
Wm (n, Im ) where Z1 , Z2 , . . . , Zn ∼i.i.d. N (0, Im ), then it also follows that
Xn   >
1/2 1/2
Σ AΣ = Σ1/2 Zi Σ1/2 Zi ∼ Wm (n, Σ)
i=1

since Σ1/2 Z ∼ N (0, Σ). Now observe that (Σ1/2 AΣ1/2 )−1 = Σ−1/2 A−1 Σ−1/2 and thus if B ∼ Wm
i
−1 (n, I )
m
then Σ−1/2 BΣ−1/2 ∼ Wm −1 (n, Σ). Let β −1 = n(n − m − 1) and C ∼ W −1 (n, β · I ). Therefore we have
m m
by the data processing inequality for total variation in Fact 6.2 that
    
−1
n, β · Σ−1 = dTV L Σ1/2 AΣ1/2 , L Σ1/2 CΣ1/2

dTV Wm (n, Σ), Wm
≤ dTV (L (A) , L (C))
r
1 
−1

≤ · dKL Wm (n, Im ) Wm (n, β · Im )
2
 
= O m3/2 n−1/2

where the last inequality follows from the fact that n  m3 , Theorem 9.6 and Pinsker’s inequality.
⊗n 2
n |S|
Suppose that X ∼ N 0, Im + θS0 vS vS> where θS0 = τ k4n . Then we have that the output Y of A2a
−1 −1 −1

satisfies Y = n(n − m − 1) · Σ̂ ∼ Wm n, β · Σ where
 −1 θS0
Σ = Im + θS0 vS vS> = Im − · vS> vS> = Im − θS vS vS>
1 + θS0
Therefore it follows from the inequality above that
   ⊗n     
0 > >
dTV A2a N 0, Im + θS vS vS , Wm n, Im − θS vS vS = O m3/2 n−1/2

Similarly, if X ∼ N (0, Im )⊗n then we have that


 
dTV A2a N (0, Im )⊗n , Wm (n, Im ) = O m3/2 n−1/2
 

applying the same argument with Σ = Im . Now note that if Y ∼ Wm n, Im − θS vS vS> then Lemma 9.4

⊗n
implies that A2b-3 produces Z ∼ N 0, Id − θS vS vS> . Similarly, it follows that if Y ∼ Wm (n, Im ) then
⊗n
Lemma 9.4 implies that Z ∼ N (0, Id ) .
We now will use Lemma 6.3 applied to the steps Ai above and the following sequence of distributions
P0 = M[m]×[n] (S × T, p, q)
 ⊗n
P1 = N 0, Im + θS0 vS vS>
 
P2a = Wm n, Im − θS vS vS>
 ⊗n
P2b-3 = N 0, Id − θS vS vS>

As in the statement of Lemma 6.3, let i be any real numbers satisfying dTV (Ai (Pi−1 ), Pi ) ≤ i for each
step i. A direct application of Lemma 9.1, shows that we can take 1 = O(m−1 n−1 ) + k(4e−3 )n/2k . The
arguments above show we can take 2a = O(m3/2 n−1/2 ) and 2b-3 = 0. Lemma 6.3 now implies the first
bound in the theorem statement. The second bound follows from an analogous argument for the distributions
P0 = Bern(q)⊗m×n , P1 = N (0, Im )⊗n , P2a = Wm (n, Im ) and P2b-3 = N (0, Id )⊗n
with 1 = O(m−1 n−1 ), 2a = O(m3/2 n−1/2 ) and 2b-3 = 0. This completes the proof of the theorem.

64
9.2 Comparing Wishart and Inverse Wishart
This section is devoted to proving the upper bound on the KL divergence between Wishart matrices and
their inverses in Theorem 9.6 used in the proof of Theorem 9.5. As noted in the previous subsection, the
next theorem also implies total variation convergence between Wishart and inverse Wishart when n  d3 by
Pinsker’s inequality. This theorem is related to a line of recent research examining the total variation conver-
gence between ensembles of random matrices in the regime where n  d. A number of recent papers have
investigated the total variation convergence between the fluctuations of the Wishart and Gaussian orthog-
onal ensembles, also showing these converge when n  d3 [JL15, BDER16, BG16, RR19], convergence
with other matrix ensembles at intermediate asymptotic scales of d  n  d3 [CW+ 19] and applications
of these results to random geometric graphs [BDER16, EM16, BBN19].
Let Γd (x) and ψd (x) denote the multivariate gamma and digamma functions given by
d   d  
d(d−1)/4
Y i−1 ∂ log Γd (a) X i−1
Γd (a) = π · Γ a− and ψd (a) = = ψ a−
2 ∂a 2
i=1 i=1

where Γ(z) and ψ = Γ0 (z)/Γ(z) denote the ordinary gamma and digamma functions. We will need several
approximations to the log-gamma and digamma functions to prove our desired bound on KL divergence.
The classical Stirling series for the log-gamma function is
  ∞
1 1 X B2k
log Γ(z) ∼ log(2π) + z − log z − z +
2 2 2k(2k − 1)z 2k−1
k=1

where Bm denotes the mth Bernoulli number. While this series does not converge absolutely for any z
because of the growth rate of the coefficients B2k , its partial sums are increasingly accurate. More precisely,
we have the following series approximation to the log-gamma function (see e.g. pg. 67 of [Rem13]) up to
second order  
1 1 1
log Γ(z) = log(2π) + z − log z − z + + O(z −3 )
2 2 12z
as z → ∞. A similar series expansion exists for the digamma function, given by

1 X B2k
ψ(z) ∼ log z − −
2z 2kz 2k
k=1

This series also exhibits the phenomenon that, while not converge absolutely for any z, its partial sums are
increasingly accurate. We have the following third order expansion of ψ(z) given by
2 ∞ t3
Z
1 1 1 1
ψ(z) = log z − − + dt = log z − − + O(z −4 )
2z 12z 2 z 2 0 (t2 + z 2 )(e2πt − 1) 2z 12z 2
as z → ∞. We now state and prove the main theorem of this section.
Theorem 9.6 (Comparing Wishart and Inverse Wishart). Let n ≥ d + 1 and m ≥ d be positive integers
1
such that n = Θ(m), |m − n| = o(n) and n − d = Ω(n) as m, n, d → ∞, and let β = m(n−d−1) . Then
  d3 s2 d(d + 1) 5sd3 sd3
dKL Wd (n, Id ) Wd−1 (m, β · Id ) = + − +

6n 8n2 24n2 12mn 
+ O d n |s| + d4 n−2 + d2 n−1
2 −3 3

where s = n − m. In particular, when m = n and n  d3 it follows that


 
dKL Wd (n, Id ) Wd−1 (n, β · Id ) = o(1)

65
Proof. Note that the given conditions also imply that m − d = Ω(m). Let X ∼ Wd (n, Id ) and Y ∼
Wd−1 (m, β · Id ). Throughout this section, A ∈ Rd×d will denote a positive semidefinite matrix. It is
well known that the Wishart distribution Wd (n, Id ) is absolutely continuous with respect to the Lebesgue
measure on the cone CdPSD of positive semidefinite matrices in Rd×d [Wis28]. Furthermore the density of X
with respect to the Lebesgue measure can be written as
 
1 (n−d−1)/2 1
fX (A) = nd/2  · |A| · exp − Tr(A)
2 · Γd n2 2

A change of variables from A → β −1 · A−1 shows that the distribution Wd−1 (m, β · Id ) is also absolutely
continuous with respect to the Lebesgue measure on CdPSD . It is well-known (see e.g. [GCS+ 13]) that the
density of Y can be written as

β −md/2 β −1
 
−(m+d+1)/2 −1

fY (A) = md/2 · |A| · exp − · Tr A
· Γd m

2 2
2

Now note that


(m − n)d m  n  md
log fX (A) − log fY (A) = · log 2 + log Γd − log Γd + · log β
2 2 2 2
m+n 1 β −1
· Tr A−1

+ · log |A| − Tr(A) +
2 2 2
The expectation of log |A| where A ∼ Wd (n, Id ) is well known (e.g. see pg. 693 of [Bis06]) to be equal to
n
EA∼Wd (n,Id ) [log |A|] = ψd + d log 2
2
Furthermore, it is well known (e.g. see pg. 85 [MKB79]) that the mean of A−1 if A ∼ Wd (n, Id ) is

Id
EA∼Wd (n,Id ) A−1 =
 
n−d−1

Therefore we have that EA∼Wd (n,Id ) Tr A−1 = d/(n−d−1). Similarly, we have that EA∼Wd (n,Id ) [A] =
 

n · Id and thus EA∼Wd (n,Id ) [Tr(A)] = nd. Combining these identities yields that
 
dKL Wd (n, Id ) Wd−1 (m, β · Id ) = EA∼Wd (n,Id ) [log fX (A) − log fY (A)]

(m − n)d m  n  md
= · log 2 + log Γd − log Γd + · log β
2 2 2 2
m + n  n  nd β −1 d
+ · ψd + d log 2 − + (7)
2 2 2 2(n − d − 1)

We now use the series approximations for Γ(z) and ψ(z) mentioned above to approximate each of these
terms. Note that since m − d = Ω(m), we have that
d  
m d(d − 1) X m−i+1
log Γd = log π + log Γ
2 4 2
i=1
d     
d(d − 1) X 1 m−i m−i+1
= log π + log(2π) + log
4 2 2 2
i=1

66
  
m−i+1 1
− + + O(m−3 )
2 6(m − i + 1)
d(d − 1) d dm d(d − 1)
= log π + log(2π) − + + O(dm−3 )
4 2 2 4
d   m m − i   
X m−i i−1 1
+ log + log 1 − +
2 2 2 m 6(m − i + 1)
i=1

using the fact that di=1 (i − 1) = d(d − 1)/2. Let Hn denote the harmonic series Hn = ni=1 1/i. Using
P P
the well-known fact that ψ(n + 1) = Hn − γ where γ is the Euler-Mascheroni constant, we have that
d
X 1
= Hm − Hm−d
m−i+1
i=1
= log(m + 1) − log(m − d + 1) + O(m−1 )
d d2
= + + O(d3 m−3 ) + O(m−1 )
m + 1 2(m + 1)2
= O dm−1


where the second last estimate follows applying the Taylor approximation log(1 − x) = −x − 21 x2 + O(x3 )
d
for x = m+1 ∈ (0, 1). Applying this Taylor approximation again, we have that

d    
X m−i i−1
log 1 −
2 m
i=1
d 
1 X (m − i)(i − 1) (m − i)(i − 1)2

3 −2

=− + +O i m
2 m 2m2
i=1
d 
1 X (m − 1)(i − 1) (i − 1)2 (m − 1)(i − 1)2 (i − 1)3

= O(d4 m−2 ) − − + −
2 m m 2m2 2m2
i=1
(m − 1)d(d − 1) d(d − 1)(2d − 1) (m − 1)d(d − 1)(2d − 1) d2 (d − 1)2
= O(d4 m−2 ) − + − +
4m 12m 24m2 16m2
d(d − 1) d(d − 1)(2d + 5)
= O(d4 m−2 ) − +
4 24m

using the identities i=1 (i − 1)2 = d(d − 1)(2d − 1)/6 and di=1 (i − 1)3 = d2 (d − 1)2 /4. Combining all
Pd P
of these approximations and simplifying using the fact that m − d = Ω(m) yields that
m d(d − 1) d dm dm  m  d(d + 1) m
log Γd = log π + log(2π) − + log − log
2 4 2 2 2 2 4 2
d(d − 1)(2d + 5)
+ O d4 m−2 + dm−1

+
24m
as m, d → ∞ and m−d = Ω(m). An analogous estimate is also true for log Γd n2 . Similar approximations


now yield since n − d = Ω(n), we have that


d    
n X n−i+1 1
ψd = log − + O(n−2 )
2 2 n−i+1
i=1

67
d  
n X i−1
= d log + log 1 − − Hn + Hn−d + O(dn−2 )
2 n
i=1
d 
i − 1 (i − 1)2 d2
n X 
3 −3
 d
= d log − + + O i n − −
2 n 2n2 n + 1 2(n + 1)2
i=1
+ O d3 n−3 + dn−2

 n  d(d − 1) d(d − 1)(2d − 1) d
+ O d4 n−3 + d2 n−2

= d log − − 2

2 2n 12n n+1
Here we have expanded ψ(n + 1) = Hn − γ to an additional order with the approximation
1 1
Hn − Hn−d = log(n + 1) − log(n − d + 1) − + + O(n−2 )
2(n + 1) 2(n − d + 1)
d d2
= + + O(dn−2 )
n + 1 2(n + 1)2

Combining all of these estimates and simplifying with β −1 = m(n − d − 1) now yields that
 
dKL Wd (n, Id ) Wd−1 (m, β · Id )

md nd β −1 d m n m + n n


= md log 2 + log β − + + log Γd − log Γd + · ψd
2 2 2(n − d − 1) 2 2 2 2
md nd −1
β d d(m − n) dm  m  dn  n 
= md log 2 + log β − + − + log − log
2 2 2(n − d − 1) 2 2 2 2 2
d(d + 1)  m  d(d − 1)(2d + 5) (m + n)d  n 
− log + · (m−1 − n−1 ) + log
4 n 24 2 2
(m + n)d(d − 1) (m + n)d(d − 1)(2d − 1) (m + n)d 4 −2 2 −1

− − − + O d n + d n
4n 24n2 2(n + 1)
 
(m + n)d(d − 1) (m + n)d d(d + 1)  m  dm d+1
=− − − log − log 1 −
4n 2(n + 1) 4 n 2 n
(m + n)d(d − 1)(2d − 1) (n − m)d(d − 1)(2d + 5) 4 −2 2 −1

− + + O d n + d n
24n2 24mn
(m + n)d(d + 1) d(d + 1) n − m (n − m)2
 
−3 3

=− + + + O n |s|
4n 4 n 2n2
2
 
dm d + 1 (d + 1) (m + n)d(d − 1)(2d − 1)
+ + 2
+ O(d3 n−3 ) −
2 n 2n 24n2
sd(d − 1)(2d + 5)
+ O d4 n−2 + d2 n−1

+
24mn
d 3 s d(d + 1) 5sd3
2 sd3 2 −3 3 4 −2 2 −1

= + − + + O d n |s| + d n + d n
6n 8n2 24n2 12mn
In the fourth equality, we used the fact that 1/(n+1) = 1/n+O(n−2 ), that s = n−m = o(n) and the Taylor
approximation log(1 − x) = −x − 21 x2 + O(x3 ) for |x| < 1. The last line follows from absorbing small
terms into the error term. The second part of the theorem statement follows immediately from substituting
m = n and s = 0 into the bound above and noting that the dominant term is d3 /6n when n  d3 .

We now make two remarks on the theorem above. The first motivates the choice of the parameter β to

68
satisfy β −1 = m(n − d − 1). Note that the KL divergence in Equation (7) depends on β through the terms

md β −1 d
log β +
2 2(n − d − 1)

which is minimized at the stationary point β −1 = m(n − d − 1). Thus the KL divergence in Equation (7)
is minimized for a fixed pair (m, n) at this value of β. We also remark that the distributions Wd (n, Id ) and
Wd−1 (m, β · Id ) only converge in KL divergence if d  n3 as the expression in Theorem 9.6 is easily seen
to not converge to zero if d = O(n3 ).

10 Negative Correlations, Sparse Mixtures and Supervised Problems


In the first part of this section, we introduce and give a reduction to the intermediate problem imbalanced
sparse Gaussian mixtures, as outlined in Section 4.3 and the beginning of Section 8. This reduction is then
used in the second part of this section, along with the reduction to negative sparse PCA in the previous
section, as a subroutine in a reduction to robust sparse linear regression and mixtures of sparse linear regres-
sions, as outlined in Section 4.4. Our reduction to imbalanced sparse Gaussian mixtures will also be used in
Section 13 to show computational lower bounds for robust sparse mean estimation.

10.1 Reduction to Imbalanced Sparse Gaussian Mixtures


In this section, we give our reduction from k- BPDS to the intermediate problem ISGM, which we will reduce
from in subsequent sections to obtain several of our main computational lower bounds. We present our re-
duction to ISGM with dense Bernoulli rotations applied with the design matrix Kr,t from Definition 8.3, and
at the end of this section sketch the variant using the random design matrix alternative Rn, introduced in
Section 8.4. Throughout this section, the input k- BPDS instance will be described by its m×n adjacency ma-
trix of Bernoulli random variables. The problem ISGM, imbalanced sparse Gaussian mixtures, is a simple vs.
simple hypothesis testing problem defined formally below. A similar distribution was also used in [DKS17]
to construct an instance of robust sparse mean estimation inducing the tight statistical-computational gap in
the statistical query model.

Definition 10.1 (Imbalanced Sparse Gaussian Mixtures). Given some µ ∈ R and  ∈ (0, 1), let µ0 be such
that  · µ0 + (1 − ) · µ = 0. For each subset S ⊆ [d], ISGMD (n, S, d, µ, ) denotes the distribution over
X = (X1 , X2 , . . . , Xn ) where Xi ∈ Rd where

X1 , X2 , . . . , Xn ∼i.i.d. MIX N (µ · 1S , Id ), N (µ0 · 1S , Id )




We will use the notation ISGM(n, k, d, µ, ) to refer to the hypothesis testing problem between H0 :
X1 , X2 , . . . , Xn ∼i.i.d. N (0, Id ) and an alternative hypothesis H1 sampling the distribution above where
S is chosen uniformly at random among all k-subsets of [d]. Our reduction k- BPDS - TO - ISGM is shown in
Figure 9. The next theorem encapsulates the total variation guarantees of this reduction. A key parameter is
the prime number r, which is used to parameterize the design matrices Kr,t in the B ERN -ROTATIONS step.
To show the tightest possible statistical-computational gaps in applications of this theorem, we ideally
would want to take n such that n = Θ(kn rt ). When r is growing with N , this induces number theoretic
constraints on our choices of parameters that require careful attention and will be discussed in Section
13.1. Because of this subtlety, we have kept the statement of our next theorem technically precise and
in terms of all of the free parameters of the reduction k- BPDS - TO - ISGM. Ignoring these number theoretic
constraints, the reduction k- BPDS - TO - ISGM can be interpreted as essentially mapping an instance of k- BPDS
√ √
with parameters (m, n, km , kn , p, q) with kn = o( n), km = o( m) and planted row indices S where

69
Algorithm k- BPDS - TO - ISGM
Inputs: Matrix M ∈ {0, 1}m×n , dense subgraph dimensions km and kn where kn divides n and the
following parameters

• partition F of [n] into kn parts of size n/kn , edge probabilities 0 < q < p ≤ 1 and a slow growing
function w(n) = ω(1)

• target ISGM parameters (N, d, µ, ) satisfying that  = 1/r for some prime number r,
c
wN ≤ kn r`, m ≤ d, n ≤ kn rt ≤ poly(n) and µ≤ p
r (r − 1) log(kn mrt )
t

rt −1
for some t ∈ N, a sufficiently small constant c > 0 and where ` = r−1

t
1. Pad: Form MPD ∈ {0, 1}m×kn r by adding kn rt − n new columns sampled i.i.d. from Bern(q)⊗m
to the right end of M . Let F 0 be the partition formed by letting Fi0 be Fi with exactly rt − n/kn
of the new columns.

2. Bernoulli Rotations: Fix a partition [kn r`] = F100 ∪ F200 ∪ · · · ∪ Fk00n into kn parts each of size r`
and compute the matrix MR ∈ Rm×kn r` as follows:

(1) For each row i and part Fj0 , apply B ERN -ROTATIONS to the vector (MPD )i,Fj0 of entries
in row i and in columns from Fj0 with matrix parameter Kr,t , rejection kernel parameter
t
p
RRK = kn mrp , Bernoulli probabilities 0 < q < p ≤ 1, λ = 1 + (r − 1)−1 , mean
parameter λ rt (r − 1) · µ and output dimension r`.
(2) Set the entries of (MR )i,Fj00 to be the entries in order of the vector output in (1).

3. Permute and Output: Form X ∈ Rd×N by choosing N distinct columns of MR uniformly at


random, embedding the resulting matrix as the first m rows of X and sampling the remaining
d − m rows of X i.i.d. from N (0, IN ). Output the columns (X1 , X2 , . . . , XN ) of X.

Figure 9: Reduction from bipartite k-partite planted dense subgraph to exactly imbalanced sparse Gaussian mixtures.

|S| = km to the instance ISGMD (N, S, d, µ, ) where  ∈ (0, 1) is arbitrary and can vary with n. The target
parameters N, d and µ satisfy that
r
1 kn
d = Ω(m), N = o(n) and µ  √ ·
log n n
All of our applications will handle the number theoretic constraints to set parameters so that they nearly
satisfy these conditions. The slow-growing function w(n) is so that Step 3 subsamples the produced samples
by a large enough factor to enable an application of finite de Finetti’s theorem.
We now state our total variation guarantees for k- BPDS - TO - ISGM. Given a partition F of [n] with
[n] = F1 ∪ F2 ∪ · · · ∪ Fkn , let Un (F ) denote the distribution of kn -subsets of [n] formed by choosing one
member element of each of F1 , F2 , . . . , Fkn uniformly at random. Let Un,kn denote the uniform distribution
on kn -subsets of [n].

70
Theorem 10.2 (Reduction from k- BPDS to ISGM). Let n be a parameter, r = r(n) ≥ 2 be a prime number
and w(n) = ω(1) be a slow-growing function. Fix initial and target parameters as follows:

• Initial k- BPDS Parameters: vertex counts on each side m and n that are polynomial in one another,
dense subgraph dimensions km and kn where kn divides n, edge probabilities 0 < q < p ≤ 1 with
min{q, 1 − q} = Ω(1) and p − q ≥ (mn)−O(1) , and a partition F of [n].

• Target ISGM Parameters: (N, d, µ, ) where  = 1/r and there is a parameter t = t(N ) ∈ N with

kn r(rt − 1)
wN ≤ , m ≤ d ≤ poly(n), n ≤ kn rt ≤ poly(n) and
r−1
δ 1
0≤µ≤ p ·p
2 6 log(kn mrt ) + 2 log(p − q)−1 rt (r − 1)(1 + (r − 1)−1 )
n    o
where δ = min log pq , log 1−p
1−q
.

Let A(G) denote k- BPDS - TO - ISGM applied with the parameters above to a bipartite graph G with m left
vertices and n right vertices. Then A runs in poly(m, n) time and it follows that

dTV A M[m]×[n] (S × T, p, q) , ISGMD (N, S, d, µ, ) = O w−1 + kn−2 m−2 r−2t


  

dTV A Bern(q)⊗m×n , N (0, Id )⊗N = O kn−2 m−2 r−2t


  

for all subsets S ⊆ [m] with |S| = km and subsets T ⊆ [n] with |T | = kn and |T ∩ Fi | = 1 for each
1 ≤ i ≤ kn .

In the rest of this section, let A denote the reduction k- BPDS - TO - ISGM with input (M, F ) where F is a
partition of [n] and output (X1 , X2 , . . . , XN ). Let Hyp(N, K, n) denote a hypergeometric distribution with
n draws from a population of size N with K success states. We will also need the upper bound on the total
variation between hypergeometric and binomial distributions given by
4n
dTV (Hyp(N, K, n), Bin(n, K/N )) ≤
N
This bound is a simple case of finite de Finetti’s theorem and is proven in Theorem (4) in [DF80]. We now
proceed to establish the total variation guarantees for Bernoulli rotations and subsampling as in Steps 2 and
3 of A in the next two lemmas.
Before proceeding to prove these lemmas, we make a definition that will be used in the next few sections.
Suppose that M is a b × a matrix, F and F 0 are partitions of [ka] and [kb] into k equally sized parts and
S ⊆ [kb] is such that |S ∩ Fi | = 1 for each 1 ≤ i ≤ k. Then define the vector v = vS,F,F 0 (M ) ∈ Rkb to be
such that the restriction vFi0 to the elements of Fi0 is given by

vFi0 = M·,σF (j) where j is the unique element in S ∩ Fi


i

Here, M·,j denotes the jth column of M and σFi denotes the order preserving bijection from Fi to [b]. In
other words, vS,F,F 0 is the vector formed by concatenating the columns of M along the partition F 0 , where
the elements S ∩ Fi select which column appears along each part Fi0 . In this section, whenever S ∩ Fi has
size one, we will abuse notation and also use S ∩ Fi to denote its unique element.

Lemma 10.3 (Bernoulli Rotations for ISGM). Let F 0 and F 00 be a fixed partitions of [kn rt ] and [kn r`] into
kn parts of size rt and r`, respectively, and let S ⊆ [m] be a fixed km -subset. Let T ⊆ [kn rt ] where

71
|T ∩ Fi0 | = 1 for each 1 ≤ i ≤ kn . Let A2 denote Step 2 of k- BPDS - TO - ISGM with input MPD and output
MR . Suppose that p, q and µ are as in Theorem 10.2, then it follows that
  p 
dTV A2 M[m]×[kn rt ] (S × T, Bern(p), Bern(q)) , L µ rt (r − 1) · 1S vT,F 0 ,F 00 (Kr,t )> + N (0, 1)⊗m×kn r`


= O kn−2 m−2 r−2t



  t
 
dTV A2 Bern(q)⊗m×kn r , N (0, 1)⊗m×kn r` = O kn−2 m−2 r−2t


Proof. First consider the case where MPD ∼ M[m]×[kn rt ] (S × T, Bern(p), Bern(q)). Observe that the
subvectors of MPD are distributed as
(  
PB Fj0 , T ∩ Fj0 , p, q if i ∈ S
(MPD )i,Fj0 ∼ t
Bern(q)⊗r otherwise

and are independent. Combining upper bound on the singular values of Kr,t in Lemma 8.5, Lemma 8.1
applied with RRK = kn mrt and the condition on µ in the statement of Theorem 10.2 implies that
  p 
= O kn−3 m−3 r−2t

00 t
dTV (MR )i,Fj , N µ r (r − 1) · (Kr,t )·,T ∩Fj , Ir`
0 if i ∈ S
 
dTV (MR )i,Fj00 , N (0, Ir` ) = O kn−3 m−3 r−2t

otherwise

Now observe that the subvectors (MR )i,Fj00 are also independent. Therefore the tensorization property of
total variation in Fact 6.2 implies that dTV (MR , L(Z)) = O kn−2 m−2 r−2t where Z is defined so that its


subvectors Zi,Fj00 are independent and distributed as


(  p 
N µ rt (r − 1) · (Kr,t )·,T ∩Fj0 , Ir` if i ∈ S
Zi,Fj00 ∼
N (0, Ir` ) otherwise

Note that the entries of Z are p


independent Gaussians each with variance 1. Furthermore, the mean of Z
can be verified to be exactly µ rt (r − 1) · 1S vT,F 0 ,F 00 (Kr,t )> . This completes the proof of the first total
variation upper bound in the statement of the lemma. The second bound follows from the same argument
above applied with S = ∅.

Lemma 10.4 (Subsampling for ISGM). Let F 0 , F 00 , S and T be as in Lemma 10.3. Let A3 denote Step 3 of
k- PDS - TO - ISGM with input MR and output (X1 , X2 , . . . , XN ). Then
   
dTV A3 τ · 1S vT,F 0 ,F 00 (Kr,t )> + N (0, 1)⊗m×kn r` , ISGMD (N, S, d, µ, ) ≤ 4w−1

where  = 1/r and µ = √ τ


. Furthermore, it holds that A3 N (0, 1)⊗m×kn r` ∼ N (0, Id )⊗N .

rt (r−1)

Proof. Suppose that MR ∼ τ · 1S KT,F > ⊗m×kn r` . For fixed S, T, F 0 and F 00 , the entries of
0 ,F 00 + N (0, 1)

MR are independent. Observe that the columns of MR are p independent and either distributed according
N (µ · 1S , Im ) or N (µ0 · 1S , Im ) where µ0 = τ (1 − r)/ rp t (r − 1) depending on whether the entry of
p
vT,F 0 ,F 00 (Kr,t ) at the index corresponding to the column is 1/ rt (r − 1) or (1 − r)/ rt (r − p 1).
By Lemma 8.4, it follows that each column of Kr,t contains exactly ` entries p equal to (1−r)/ rt (r − 1).
t
This implies that exactly kn (r − 1)` entries of vT,F 0 ,F 00 (Kr,t ) are equal to 1/ r (r − 1). Define RN (s) to
be the distribution on RN with a sample vp ∼ RN (s) generated by first choosingp an s-subset U of [N ] uni-
formly at random and then setting vi = 1/ r (r − 1) if i ∈ U and vi = (1 − r)/ rt (r − 1) if i 6∈ U . Note
t

72
that the number of columns distributed as N (µ · 1S , Im ) in MR chosen to be in X is distributed according
to Hyp(kn r`, kn (r − 1)`, N ). Step 3 of A therefore ensures that, if MR is distributed as above, then
 
X ∼ L τ · 1S RN (Hyp(kn `, kn (r − 1)`, N ))> + N (0, 1)⊗d×N

Observe that the data matrix for a sample from ISGMD (N, S, d, µ, ) can be expressed similarly as
 
> ⊗d×N
ISGM D (N, S, d, µ, ) = L τ · 1S Rn (Bin(N, 1 − )) + N (0, 1)
p
where again we set µ = τ / rt (r − 1). The conditioning property of dTV in Fact 6.2 now implies that
4N
dTV (L(X), ISGMD (N, S, d, µ, )) ≤ dTV (Bin(N, 1 − ), Hyp (kn r`, kn (r − 1)`, N )) ≤ ≤ 4w−1
kn r`
The last inequality follows from the application of Theorem (4) in [DF80] to hypergeometric distributions
above along with the fact that 1 −  = (kn (r − 1)`)/kn r` and wN ≤ kn r`. This completes the proof of
 statement. Now consider applying the above argument with τ = 0. It follows
the upper bound in the lemma
that A3 N (0, 1)⊗m×kn r` ∼ N (0, 1)⊗d×N = N (0, Id )⊗N , which completes the proof of the lemma.

We now combine these lemmas to complete the proof of Theorem 10.2.

Proof of Theorem 10.2. We apply Lemma 6.3 to the steps Ai of A under each of H0 and H1 to prove
Theorem 10.2. Define the steps of A to map inputs to outputs as follows
A A A
1
(M, F ) −−→ (MPD , F 0 ) −−→
2
(MR , F 00 ) −→
3
(X1 , X2 , . . . , XN )
We first prove the desired result in the case that H1 holds. Consider Lemma 6.3 applied to the steps Ai
above and the following sequence of distributions
P0 = M[m]×[n] (S × T, Bern(p), Bern(q))
P1 = M[m]×[kn rt ] (S × T, Bern(p), Bern(q))
P2 = µ rt (r − 1) · 1S vT,F 0 ,F 00 (Kr,t )> + N (0, 1)⊗m×kn r`
p

P3 = ISGMD (N, S, d, µ, )
As in the statement of Lemma 6.3, let i be any real numbers satisfying dTV (Ai (Pi−1 ), Pi ) ≤ i for each
step i. By construction, the step A1 is exact and we can take 1 = 0. Lemma 10.3 yields that we can take
2 = O kn−2 m−2 r−2t . Applying Lemma 10.4 yields that we can take 3 = 4w−1 . By Lemma 6.3, we
therefore have that
dTV A M[m]×[n] (S × T, p, q) , ISGMD (N, S, d, µ, ) = O w−1 + kn−2 m−2 r−2t
  

which proves the desired result in the case of H1 . Now consider the case that H0 holds and Lemma 6.3
applied to the steps Ai and the following sequence of distributions
t
P0 = Bern(Q)⊗m×n , P1 = Bern(Q)⊗m×kn r , P2 = N (0, 1)⊗m×kn r` and P3 = N (0, Id )⊗N
As above, Lemmas 10.3 and 10.4 imply that we can take
2 = O kn−2 m−2 r−2t

1 = 0, and 3 = 0
By Lemma 6.3, we therefore have that
dTV A Bern(q)⊗m×n , N (0, Id )⊗N = O kn−2 m−2 r−2t
  

which completes the proof of the theorem.

73
As discussed in Section 8.4, we can replace Kr,t in k- BPDS - TO - ISGM with the random matrix alternative
RL, . More precisely, let k- BPDS - TO - ISGMR denote the reduction in Figure 9 with the following changes:

• At the beginning of the reduction, rejection sample RL, for at most Θ((log L)2 ) iterations until the
criteria of Lemma 8.13 are met, as outlined in Section 8.4. Let A ∈ RL×L be the resulting matrix or
stop the reduction if no such matrix is found. The latter case contributes L−ω(1) to each of the total
variation errors in Corollary 10.5.

• The dimensions r` and rt of the matrix Kr,t used in B ERN -ROTATIONS in Step 2 are both replaced
throughout the reduction by the parameter L. This changes the output dimensions of MPD and MR in
Steps 1 and 2 to both be m × kn L.

• In Step 2, apply B ERN -ROTATIONS with A instead of Kr,t and let λ = C where C is the constant in
Lemma 8.13.

The reduction k- BPDS - TO - ISGMR eliminates a number-theoretic constraint in k- BPDS - TO - ISGM arising
from the fact the intermediate matrix MR has a dimension that must be of the form kn rt for some integer
t. In contrast, k- BPDS - TO - ISGMR only requires that this dimension of MR be a multiple of kn . This will
remove the condition (T) from our computational lower bounds for RSME, which is only restrictive in the
very small  regime of  = n−Ω(1) . We will deduce this computational lower bound for RSME implied by
the reduction k- BPDS - TO - ISGMR formally in Section 13.1.
The reduction k- BPDS - TO - ISGMR can be analyzed using an argument identical to the one above, with
Lemma 8.13 used in place of Lemma 8.5 and accounting for the additional L−ω(1) total variation error
incurred by failing to obtain a Rn, satisfying the criteria in Lemma 8.13. Carrying this out yields the
following corollary. We remark that the new condition   L−1 log L in the corollary below will amount
to the condition   N −1/2 log N in our computational lower bounds. This is because, in our applications,


we will typically set N = Θ̃(kn L) and kn to be very close to but slightly smaller than n = Θ̃( N ), to
ensure that the input k- BPDS instance is hard. These conditions together with   L−1 log L amount to the
condition on the target parameters given by   N −1/2 log N .

Corollary 10.5 (Reduction from k- BPDS to ISGM with RL, ). Let n be a parameter and let w(n) = ω(1)
be a slow-growing function. Fix initial and target parameters as follows:

• Initial k- BPDS Parameters: m, n, km , kn , p, q and F as in Theorem 10.2.

• Target ISGM Parameters: (N, d, µ, ) such that there is a parameter L = L(N ) ∈ N such that
L(N ) → ∞ and it holds that

w log L 1
max{wN, n} ≤ kn L ≤ poly(n), m ≤ d ≤ poly(n), ≤≤ and
L 2
r
Cδ 
0≤µ≤ p ·
log(kn mL) + log(p − q)−1 L
for some sufficiently small constant C > 0, where δ is as in Theorem 10.2.

If A denotes k- BPDS - TO - ISGMR applied with the parameters above, then A runs in poly(m, n) time and
 
dTV A M[m]×[n] (S × T, p, q) , ISGMD (N, S, d, µ, ) = o(1)
dTV A Bern(q)⊗m×n , N (0, Id )⊗N = o(1)
 

for all km -subsets S ⊆ [m] and kn -subsets T ⊆ [n] with |T ∩ Fi | = 1 for each 1 ≤ i ≤ kn .

74
Algorithm k- BPDS - TO - MSLR
Inputs: Matrix M ∈ {0, 1}m×n , dense subgraph dimensions km and kn where kn divides n and the
following parameters

• partition F , edge probabilities 0 < q < p ≤ 1 and w(n) as in Figure 9

• target MSLR parameters (N, d, γ, ) and prime r and t ∈ N where N, d, r, t, ` and  = 1/r are as
in Figure 9 with the additional requirement that N ≤ n and where γ ∈ (0, 1) satisfies that
 
2 km kn km
γ ≤ c · min ,
rt+1 log(kn mrt ) log N n log(mn)

for a sufficiently small constant c > 0.

1. Clone: Compute the matrices MISGM ∈ {0, 1}m×n and MNEG - SPCA ∈ {0, 1}m×n by applying
B ERNOULLI -C LONE with t = 2 copies to the entries of pthe matrix M with input Bernoulli proba-
√ 
bilities p and q, and output probabilities p and Q = 1 − (1 − p)(1 − q) + 1{p=1} q−1 .

2. Produce ISGM Instance: Form (Z1 , Z2 , . . . , ZN ) where Zi ∈ Rd as the output of


k- BPDS - TO - ISGM applied to the matrix MISGM with partition F , edge probabilities 0 < Q <
p ≤ 1, slow-growing function w, target ISGM parameters (N, d, µ, ) and µ > 0 given by
r
log N
µ = 4γ ·
km

3. Produce NEG - SPCA Instance: Form (W1 , W2 , . . . , Wn ) where Wi ∈ Rd as the output of


BPDS - TO - NEG - SPCA applied to the matrix MNEG - SPCA with edge probabilities 0 < Q < p ≤ 1,
target dimension d and parameter τ > 0 satisfying that

8nγ 2
τ2 =
kn km (1 − γ 2 )

4. Scale and Label ISGM p Instance: Generate y1 , y2 , . . . , yN ∼i.i.d. N (0, 1 + γ 2 ) and truncate each
yi to satisfy |yi | ≤ 2 (1 + γ 2 ) log N . Generate G1 , G2 , . . . , GN ∼i.i.d. N (0, Id ) and form
(Z10 , Z20 , . . . , ZN
0 ) where Z 0 ∈ Rd as
i
s
yi2
r
y i 2
Zi0 = · Z i + 1 − · Gi
4(1 + γ 2 ) log N 4(1 + γ 2 )2 log N

5. Merge and Output: For each 1 ≤ i ≤ N , let Xi = √1


2
(Zi0 + Wi ) and output the N labelled pairs
(X1 , y1 ), (X2 , y2 ), . . . , (XN , yN ).

Figure 10: Reduction from bipartite planted dense subgraph to mixtures of sparse linear regressions through imbal-
anced Gaussian mixtures and negative sparse PCA

75
10.2 Sparse Mixtures of Regressions and Negative Sparse PCA
In this section, we combine the previous two reductions to NEG - SPCA and ISGM with some additional
observations to produce a single reduction that will be used to prove two of our main results in Section 13.3
– computational lower bounds for mixtures of SLRs and robust SLR. We begin this section by generalizing
our definition of the distribution MSLRD (n, S, d, γ, 1/2) from Section 6.3 to simultaneously capture the
mixtures of SLRs distributions we will reduce to and our adversarial construction for robust SLR.
Recall from Section 6.3 that LRd (v) denotes the distribution of a single sample-label pair (X, y) ∈
d
R × R given by y = hv, Xi + η where X ∼ N (0, Id ) and η ∼ N (0, 1). Our generalization of MSLRD will
be parameterized by  ∈ (0, 1). The canonical setup for mixtures of SLRs from Section 6.3 corresponds to
setting  = 1/2 and formally is restated in the following definition for convenience.

Definition 10.6 (Mixtures of Sparse Linear Regressions with  = 1/2). Let γ ∈ R be such that γ > 0.
For each subset S ⊆ [d], let MSLRD (n, S, d, γ, 1/2) denote the distribution over n-tuples of independent
data-label pairs (X1 , y1 ), (X2 , y2 ), . . . , (Xn , yn ) where Xi ∈ Rd and yi ∈ R are sampled as follows:

• first sample n independent Rademacher random variables s1 , s2 , . . . , sn ∼i.i.d. Rad; and

• then form data-label pairs (Xi , yi ) ∼ LRd (γsi vS ) for each 1 ≤ i ≤ n.

where vS ∈ Rd is the |S|-sparse unit vector vS = |S|−1/2 · 1S .

Our more general formulation when  < 1/2 is described in the definition below. When  < 1/2, the
distribution MSLRD (n, S, d, γ, ) can always be produced by an adversary in robust SLR. This observation
will be discussed in more detail and used in Section 13.3 to show computational lower bounds for robust
SLR. The reason we have chosen to write these two different distributions under a common notation is that
the main reduction of this section, k- BPDS - TO - MSLR, will simultaneously map to both mixtures of SLRs
and robust SLR. Lower bounds for the mixture problem will be obtained by setting r = 2 in the reduction
to ISGM used as a subroutine in k- BPDS - TO - MSLR, while lower bounds for robust sparse regression will be
obtained by taking r > 2. These implications of k- BPDS - TO - MSLR are discussed further in Section 13.

Definition 10.7 (Mixtures of Sparse Linear Regressions with  < 1/2). Let γ > 0,  ∈ (0, 1/2) and let
a denote a = −1 (1 − ). For each subset S ⊆ [d], let MSLRD (n, S, d, γ, ) denote the distribution over
n-tuples of data-label pairs (X1 , y1 ), (X2 , y2 ), . . . , (Xn , yn ) sampled as follows:

• the pairs (b1 , X1 , y1 ), (b2 , X2 , y2 ), . . . , (bn , Xn , yn ) are i.i.d. and b1 , b2 , . . . , bn ∼i.i.d. Bern(1 − );

• if bi = 1, then (Xi , yu ) ∼ LRd (γvS ) where vS is as in Definition 10.6; and

• if bi = 0, then (Xi , yi ) is jointly Gaussian with mean zero and (d + 1) × (d + 1) covariance matrix
 " 2 −1)γ 2
#
Id + (a1+γ > −aγ · v

ΣXX ΣXy 2 · v v
S S S
=
ΣyX Σyy −aγ · vS> 1 + γ2

The main reduction of this section from k- BPDS to MSLR is shown in Figure 10. This reduction inherits
the number theoretic constraints of our reduction to ISGM mentioned in the previous section. These will be
discussed in more detail when k- BPDS - TO - MSLR is used to deduce computational lower bounds in Section
13.3. The following theorem gives the total variation guarantees for k- BPDS - TO - MSLR.

Theorem 10.8 (Reduction from k- BPDS to MSLR). Let n be a parameter, r = r(n) ≥ 2 be a prime number
and w(n) = ω(1) be a slow-growing function. Fix initial and target parameters as follows:

76
• Initial k- BPDS Parameters: vertex counts on each side m and n that are polynomial in one another
and satisfy the condition that n  m3 , subgraph dimensions km and kn where kn divides n, constant
densities 0 < q < p ≤ 1 and a partition F of [n].

• Target MSLR Parameters: (N, d, γ, ) where  = 1/r and there is a parameter t = t(N ) ∈ N with

kn r(rt − 1)
N ≤ n, wN ≤ , m ≤ d ≤ poly(n), and n ≤ kn rt ≤ poly(n)
r−1
and where γ ∈ (0, 1/2) satisfies that
 
2 km kn km
γ ≤ c · min t+1 t
,
r log(kn mr ) log N n log(mn)

for a sufficiently small constant c > 0.

Let A(G) denote k- BPDS - TO - MSLR applied with the parameters above to a bipartite graph G with m left
vertices and n right vertices. Then A runs in poly(m, n) time and it follows that
 
dTV A M[m]×[n] (S × T, p, q) , MSLRD (N, S, d, γ, ) = O w−1 + kn−2 m−2 r−2t + m3/2 n−1/2
 
 
+ O kn (4e−3 )n/2kn + N −1
 ⊗N   
dTV A Bern(q)⊗m×n , N (0, Id ) ⊗ N 0, 1 + γ 2 = O kn−2 m−2 r−2t + m3/2 n−1/2


for all subsets S ⊆ [m] with |S| = km and subsets T ⊆ [n] with |T | = kn and |T ∩ Fi | = 1 for each
1 ≤ i ≤ kn .

The proof of this theorem will be broken into several lemmas for clarity. The following four lemmas
analyze the approximate Markov transition properties of Steps 4 and 5 of k- BPDS - TO - MSLR. The first three
lemmas establishes a total variation upper bound in the single sample case. The fourth lemma is a simple
consequence of the first two and establishes the Markov transition properties for Steps 4 and 5 together.

Lemma 10.9 (Planted Single Sample Labelling). Let N be a parameter, γ, µ0 ∈ (0, 1), C > 0 be a constant
and u ∈ Rd be such that kuk2 = 1 and 4C 2 γ 2 ≤ (µ0 )2 / log N . Define the random variables (X, y) and
(X 0 , y 0 ) where X, X 0 ∈ Rd and y, y 0 ∈ R as follows:

• Let X ∼ N (0, Id ) and η ∼ N (0, 1) be independent, and define

y = γ · hu, Xi + η

• Let y 0 be a sample from N (0, 1 + γ 2 ) truncated 0 | ≤ C (1 + γ 2 ) log N , and let Z ∼


p
 to satisfy |y 
2γ 2
N (µ0 · u, Id ), G ∼ N (0, Id ) and W ∼ N 0, Id − 1+γ 2 · uu
> be independent. Now let X 0 be


 s 
0 2
 0
2
1 γ · y γ · y
X0 = √  0 ·Z + 1−2 · G + W (8)
2 µ (1 + γ 2 ) µ0 (1 + γ 2 )

Then it follows that, as N → ∞,


 2

dTV L(X, y), L(X 0 , y 0 ) = O N −C /2


77
Proof. First observe that 4C 2 γ 2 ≤ (µ0 )2 / log N implies that since |y 0 | ≤ C (1 + γ 2 ) log N holds almost
p

surely and γ ∈ (0, 1), it follows that


2
γ · y0

2 ≤ 2(1 + γ 2 )C 2 γ 2 (µ0 )−2 log N ≤ 1
µ0 (1 + γ 2 )
and hence X 0 is well-defined almost surely.
Now note that since y is a linear function of X and η, which are independent Gaussians, it follows that
the d + 1 entries of (X, y) are jointly Gaussian. Since kuk2 = 1, it follows that Var(y) = 1 + γ 2 and
furthermore Cov(y, X) = E[Xy] = γ · u. This implies that the covariance matrix of (X, y) is given by
 
Id γ·u
γ · u> 1 + γ 2

It is well known that X|y is a Gaussian vector with mean and covariance matrix given by

γ2
 
γ·y >
L(X|y) = N · u, Id − · uu
1 + γ2 1 + γ2
Now consider L(X 0 |y 0 ). Let Z = µ0 · u + G0 where G0 ∼ N (0, Id ) and note that
s 2
0 0 γ · y0

0 γ · y γ · y 0 1 1
X = 2
·u+ 0 2
·G + √ · 1−2 0 2
·G+ √ ·W
1+γ µ (1 + γ ) 2 µ (1 + γ ) 2
Note that since y 0 , G0 , G and W are independent, it follows that all of the entries of the second, third and
fourth terms in the expression above are jointly Gaussian conditioned on y 0 . Therefore the entries of X 0 |y 0
are also jointly Gaussian. Furthermore the second, third and fourth terms in the expression above for X 0
have covariance matrices given by
2 2 !
γ · y0 γ · y0 γ2
 
1 1
0 2
· Id , − 0 2
· Id and · Id − · uu>
µ (1 + γ ) 2 µ (1 + γ ) 2 1 + γ2

respectively, conditioned on y 0 . Since these three terms are independent conditioned on y 0 , it follows that
γ2
X 0 |y 0 has covariance matrix Id − 1+γ 2 · uu
> and therefore that

γ2
 
0 0 γ·y >
L(X |y ) = N · u, Id − · uu
1 + γ2 1 + γ2
Rx 2
and is hence identically distributed to L(X|y). Let Φ(x) = √12π −∞ e−x /2 dx be the CDF of N (0, 1). The
conditioning property of total variation in Fact 6.2 therefore implies that

dTV L(X, y), L(X 0 , y 0 ) ≤ dTV L(y), L(y 0 )


 
h p i
= P |y| > c (1 + γ 2 ) log N
  p 
= 2 · 1 − Φ C log N
 2

= O N −C /2

where the first equality holds due to the conditioning on an event property of total variation in Fact 6.2 and
2
the last upper bound follows from the standard estimate 1 − Φ(x) ≤ √12π · x−1 · e−x /2 for x ≥ 1. This
completes the proof of the lemma.

78
The next lemma establishes single sample guarantees that will be needed to analyze the case in which
 < 1/2. The proof of this lemma is very similar to that of Lemma 10.9 and is deferred to Appendix A.4.

Lemma 10.10 (Imbalanced Planted Single Sample Labelling). Let N, γ, µ0 , C and u be as in Lemma 10.9
and let µ00 ∈ (0, 1). Define the random variables (X, y) and (X 0 , y 0 ) as follows:

• Let (X, y) where X ∈ Rd and y ∈ R be jointly Gaussian with mean zero and (d + 1) × (d + 1)
covariance matrix given by
 " 2 −1)γ 2
#
Id + (a1+γ >

ΣXX ΣXy 2 · uu aγ · u
=
ΣyX Σyy aγ · u> 1 + γ2

• Let y 0 , Z, G and W be independent where y 0 , G and W are distributed as in Lemma 10.9 and Z ∼
N (µ00 · u, Id ). Let X 0 be defined by Equation (8) as in Lemma 10.9.

Then it follows that, as N → ∞,


 2

dTV L(X, y), L(X 0 , y 0 ) = O N −C /2


We now state a similar lemma analyzing a single sample in Step 4 of k- BPDS - TO - MSLR in the case
where X and W are not planted. Its proof is also deferred to Appendix A.4.

Lemma 10.11 (Unplanted Single Sample Labelling). Let N, γ, µ0 , C p and u be as in Lemma 10.9. Suppose
that y 0 is a sample from N (0, 1 + γ 2 ) truncated to satisfy |y 0 | ≤ C (1 + γ 2 ) log N and Z, G, W ∼i.i.d.
N (0, Id ) are independent. Let X 0 be defined by Equation (8) as in Lemma 10.9. Then, as N → ∞,
 2

dTV L(X 0 , y 0 ), N (0, Id ) ⊗ N (0, 1 + γ 2 ) = O N −C /2


Combining these three lemmas, we now can analyze Step 4 and Step 5 of A. Let A4-5 (Z, W ) de-
note Steps 4 and 5 of A with inputs Z = (Z1 , Z2 , . . . , ZN ) and W = (W1 , W2 , . . . , Wn ) and output
((X1 , y1 ), (X2 , y2 ), . . . , (XN , yN )). The next lemma applies the previous two lemmas to establish the
Markov transition properties of A4-5 .

Lemma 10.12 (Scaling and Labelling ISGM Instances). Let r, N, d, γ, , m, n, kn , km and S ⊆ [m] where
|S| = km be as in Theorem 10.8 and let µ, γ, θ > 0 be such that
r
log N 8nγ 2 τ 2 kn km
µ = 4γ · , τ2 = and θ =
km kn km (1 − γ 2 ) 4n + τ 2 kn km
⊗n
If Z ∼ ISGM(N, S, d, µ, ) and W ∼ N 0, Id − θvS vS> , then

dTV (A4-5 (Z, W ), MSLRD (N, S, d, γ, )) = O N −1




If Z ∼ N (0, Id )⊗N and W ∼ N (0, 1)⊗d×n , then


 
dTV A4-5 (Z, W ), (N (0, Id ) ⊗ N (0, 1))⊗N = O N −1


Proof. We treat the cases in which  = 1/2 and  < 1/2 as well as the two possible distributions of
(Z, W ) in the lemma statement separately. We first consider the case where  = 1/2 and r = 2 and

79
⊗n
Z ∼ ISGMD (N, S, d, µ, ) and W ∼ N 0, Id − θvS vS> . The Zi are independent and can be generated
by first sampling s1 , s2 , . . . , sN ∼i.i.d. Bern(1/2) and then setting
 √
N (µ km · vS , Id ) if si = 1
Zi ∼ √
N (−µ km · vS , Id ) if si = 0
−1/2 √
where vS = km · 1S . Let µ0 = µ km . It can be verified that the settings of µ, γ and θ above ensure that

2γ 2
r
γ 2 1 2
= · and θ =
µ0 (1 + γ 2 ) 4(1 + γ 2 ) log N 1 + γ2

Let X ∼ N (0, Id ) and η ∼ N (0, 1) be independent. Applying Lemma 10.9 with µ0 = µ km , C = 2,
u = vS and u = −vS , the equalities above and the definition of Xi in Figure 10 now imply that

dTV (L(Xi , yi |si = 1), L (X, γ · hvS , Xi + η)) = O(N −2 )


dTV (L(Xi , yi |si = 0), L (X, −γ · hvS , Xi + η)) = O(N −2 )

for each 1 ≤ i ≤ N . The conditioning property of total variation from Fact 6.2 now implies that if
L1 = L (X, γ · hvS , Xi + η) and L2 = L (X, −γ · hvS , Xi + η), then we have that

dTV L(Xi , yi ), MIX1/2 (L1 , L2 ) = O(N −2 )




For the given distribution on (Z, W ), observe that the pairs (Xi , yi ) for 1 ≤ i ≤ N are independent by
construction in A. Thus the tensorization property of total variation from Fact 6.2 implies that

dTV (L ((X1 , y1 ), (X2 , y2 ), . . . , (XN , yN )) , MSLR(N, S, d, γ, 1/2)) = O(N −1 )

where MSLRD (N, S, d, γ, 1/2) = MIX1/2 (L1 , L2 )⊗N , which establishes the desired bound when  = 1/2
and for the first distribution of (Z, W ).
The other two cases will follow by nearly identical arguments. Consider the case where  is arbitrary

and if Z ∼ N (0, Id )⊗N and W ∼ N (0, 1)⊗d×n , applying Lemma 10.11 with C = 2 and µ0 = µ km
yields that
dTV (L(Xi , yi ), N (0, Id ) ⊗ N (0, 1)) = O(N −2 )
Applying the tensorization property of total variation from Fact 6.2 as above then implies the second bound
in the lemma statement. Finally, consider the case in which  < 1/2, r > 2 and (Z, W ) is still distributed
⊗n
as Z ∼ ISGMD (N, S, d, µ, ) and W ∼ N 0, Id − θvS vS> . If the si are defined as above, then the Zi
are distributed as  √ 
N µ km · vS , Id  if si = 1
Zi ∼ √
N −aµ km · vS , Id if si = 0

where a = −1 (1 − ). Now consider applying Lemma 10.10 with µ0 = µ km , µ00 = aµ0 = µ−1 (1 − ),
C = 2 and u = −vS . This yields that

dTV (L(Xi , yi |si = 0), L(X, y)) = O(N −2 )

where X and y are as in the statement of Lemma 10.10. Combining this with the conditioning property
of total variation from Fact 6.2, the application of Lemma 10.9 in the first case above, the tensorization
property of total variation from Fact 6.2 as in the previous argument and Definition 10.7 yields that

dTV (L ((X1 , y1 ), (X2 , y2 ), . . . , (XN , yN )) , MSLR(N, S, d, γ, )) = O N −1




which completes the proof of the lemma.

80
With this lemma, the proof of Theorem 10.8 reduces to an application of Lemma 6.3 through a similar
argument to the proof of Theorem 10.2.

Proof of Theorem 10.8. Define the steps of A to map inputs to outputs as follows
A
1 2A 3A A
M −−→ (MISGM , MNEG - SPCA ) −−→ (Z, MNEG - SPCA ) −−→ (Z, W ) −−4-5
→ ((X1 , y1 ), (X2 , y2 ), . . . , (XN , yN ))
where Z = (Z1 , Z2 , . . . , ZN ) and W = (W1 , W2 , . . . , Wn ) in Figure 10. First note that the condition on γ
in the theorem statement along with the settings of µ and τ in Figure 10 imply that
    
δ p 1−Q
τ≤ p where δ = min log , log
2 6 log(mn) + 2 log(p − Q)−1 Q 1−p
δ 1
µ≤ p ·p
2 6 log(kn mrt ) + 2 log(p − Q)−1 rt (r − 1)(1 + (r − 1)−1 )
for a sufficiently small constant c > 0 since 0 < q < p ≤ 1 are constants. Let θ and vS be as in Lemma
10.12. Consider Lemma 6.3 applied to the steps Ai above and the following sequence of distributions
P0 = M[m]×[n] (S × T, Bern(p), Bern(q))
P1 = M[m]×[n] (S × T, Bern(p), Bern(Q)) ⊗ M[m]×[n] (S × T, Bern(p), Bern(Q))
P2 = ISGMD (N, S, d, µ, ) ⊗ M[m]×[n] (S × T, Bern(p), Bern(Q))
 ⊗n
P3 = ISGMD (N, S, d, µ, ) ⊗ N 0, Id − θvS vS>
P4-5 = MSLRD (N, S, d, γ, )
Combining the inequalities above for µ and τ with Lemmas 7.4 and 10.12 and Theorems 10.2 and 9.5
implies that we can take
 
1 = 0, 2 = O w−1 + kn−2 m−2 r−2t , 3 = O m3/2 n−1/2 + kn (4e−3 )n/2kn and 4-5 = O(N −1 )


Applying Lemma 6.3 now yields the first total variation upper bound in the theorem. Now consider Lemma
6.3 applied to
P0 = Bern(q)⊗m×n
P1 = Bern(Q)⊗m×n ⊗ Bern(Q)⊗m×n
P2 = N (0, Id )⊗N ⊗ Bern(Q)⊗m×n
P3 = N (0, Id )⊗N ⊗ N (0, Id )⊗n
⊗N
P4-5 = N (0, Id ) ⊗ N (0, 1 + γ 2 )
By Lemmas 7.4 and 10.12 and Theorems 10.2 and 9.5, we can take
 
1 = 0, 2 = O kn−2 m−2 r−2t , 3 = O m3/2 n−1/2 4-5 = O(N −1 )

and

Applying Lemma 6.3 now yields the second total variation upper bound in the theorem and completes the
proof of the theorem.

As in the previous section, the random matrix RL, can be used in place of Kr,t in our reduction k- BPDS -
TO - MSLR . Specifically, replacing k- BPDS - TO - ISGM in Step 2 with k- BPDS - TO - ISGMR and again replacing
t
r with the more flexible parameter L yields an alternative reduction k- BPDS - TO - MSLRR . The guarantees
below for this modified reduction follow from the same argument as in the proof of Theorem 10.8, using
Corollary 10.5 in place of Theorem 10.2.

81
Corollary 10.13 (Reduction from k- BPDS to MSLR with RL, ). Let n be a parameter and let w(n) = ω(1)
be a slow-growing function. Fix initial and target parameters as follows:

• Initial k- BPDS Parameters: m, n, km , kn , p, q and F as in Theorem 10.8.

• Target MSLR Parameters: (N, d, γ, ) and a parameter L = L(N ) ∈ N such that N ≤ n and
(N, d, , L) satisfy the conditions in Corollary 10.5. Suppose that γ ∈ (0, 1/2) satisfies that
 
2 km kn km
γ ≤ c · min ,
L log(kn mL) log N n log(mn)

for a sufficiently small constant c > 0.

If A denotes k- BPDS - TO - MSLRR applied with the parameters above, then A runs in poly(m, n) time and
 
dTV A M[m]×[n] (S × T, p, q) , MSLRD (N, S, d, γ, ) = o(1)
 ⊗N 
dTV A Bern(q)⊗m×n , N (0, Id ) ⊗ N 0, 1 + γ 2

= o(1)

for all km -subsets S ⊆ [m] and kn -subsets T ⊆ [n] with |T ∩ Fi | = 1 for each 1 ≤ i ≤ kn .

11 Completing Tensors from Hypergraphs


In this section we introduce a key subroutine that will be used in our reduction to tensor PCA in Section
15. The starting point for our reduction k- HPDS - TO - TPCA is the hypergraph problem k- HPDS. The adja-
cency tensor of this instance is missing all entries with at least one pair of equal indices. The first procedure
A DVICE -C OMPLETE -T ENSOR in this section gives a method of completing these missing entries and pro-
ducing an instance of the planted sub-tensor problem, given access to a set of s − 1 vertices in the clique,
where s is the order of the target tensor. In order to translate this into a reduction, we iterate over all
(s − 1)-sets of vertices and carry out this reduction for each one, as will be described in more detail later
in this section. For the motivation and high-level ideas behind the reductions in this section, we refer to the
discussion in Section 4.6.
In order to describe our reduction A DVICE -C OMPLETE -T ENSOR, we will need the following definition
which will be crucial in indexing the missing entries of the tensor.

Definition 11.1 (Tuple Statistics). Given a tuple I = (i1 , i2 , . . . , is ) where each ij ∈ U for some set U , we
define the partition P (I) and permutations τP (I) and τV (I) of [s] as follows:

1. Let P (I) be the unique partition of [s] into nonempty parts P1 , P2 , . . . , Pt where ik = il if and only if
k, l ∈ Pj for some 1 ≤ j ≤ t, and let |P (I)| = t.

2. Given the partition P (I), let τP (I) be the permutation of [s] formed by ordering the parts Pj in
increasing order of their largest element, and then listing the elements of the parts Pj according to
this order, where the elements of each individual part are written in decreasing order.

3. Let P10 , P20 , . . . , Pt0 be the ordering of the parts of P (I) as defined above and let v1 , v2 , . . . , vt be such
that vj = ik for all k ∈ Pj0 or in other words vj is the common value of ik of all indices k in the part
Pj0 . The values v1 , v2 , . . . , vt are by definition distinct and their ordering induces a permutation σ on
[t]. Let τV (I) be the permutation on [s] formed by setting (τV (I))[t] = σ and extending σ to [s] by
taking (τV (I)) (j) = j for all t < j ≤ s.

82
Algorithm A DVICE -C OMPLETE -T ENSOR
Inputs: HPDS instance H ∈ Gns with edge probabilities 0 < q < p ≤ 1, an (s − 1)-set of advice vertices
V = {v1 , v2 , . . . , vs−1 } of H

1. Clone Hyperedges: Compute the (s!)2 hypergraphs H σ1,σ2 ∈ Gns for each pair σ1 , σ2 ∈ Ss by
applying B ERNOULLI -C LONE with t = (s!)2 to the Ns hyperedge indicators of H with input
Bernoulli probabilities p and q and output probabilities p and
 
Q = 1 − (1 − p)1−1/t (1 − q)1/t + 1{p=1} q 1/t − 1

2. Form Tensor Entries: For each I = (i1 , i2 , . . . , is ) ∈ ([N ]\V )s , set the (i1 , i2 , . . . , is )th entry of
the tensor T with dimensions (N − s + 1)⊗s to be the following hyperedge indicator
n  o
Ti1 ,i2 ,...,is = 1 {v1 , v2 , . . . , vs−|P (I)| } ∪ {i1 , i2 , . . . , is } ∈ E H τP (I),τV (I)

where P (I), τP (I) and τV (I) are as in Definition 11.1.

3. Output: Output the order s tensor T with axes indexed by the set [N ]\V .

Algorithm I TERATE - AND -R EDUCE


Inputs: k- HPDS instance H ∈ Gns with edge probabilities 0 < q < p ≤ 1, partition E of [n] into k
equally-sized parts, a one-sided blackbox B for the corresponding planted tensor problem

1. For every (s − 1)-set of vertices {v1 , v2 , . . . , vs−1 } all from different parts of E, form the tensor T
by applying A DVICE -C OMPLETE -T ENSOR to H and {v1 , v2 , . . . , vs−1 }, remove the indices of T
that are in the same part of E as at least one of {v1 , v2 , . . . , vs−1 } and run the blackbox B on the
resulting tensor T .

2. Output H1 if any application if B in Step 1 output H1 .

Figure 11: The first reduction is a subroutine to complete the entries of a planted dense sub-hypergraph problem into
a planted tensor problem given an advice set of vertices. The second reduction uses this subroutine to reduce solving
a planted dense sub-hypergraph problem to producing a one-sided blackbox solving the planted tensor problem.

Note that |P (I)| is the number of distinct values in I and thus |P (I)| = |{i1 , i2 , . . . , is }| for each I. For
example, if I = (4, 4, 1, 2, 2, 5, 3, 5, 2) and s = 9, then P (I), τP (I) and τV (I) are
P (I) = {{1, 2}, {3}, {4, 5, 9}, {6, 8}, {7}} , τP (I) = (2, 1, 3, 7, 8, 6, 9, 5, 4) and
τV (I) = (4, 1, 3, 5, 2, 6, 7, 8, 9)
We now establish themain Markov transition properties of A DVICE -C OMPLETE -T ENSOR. Given a set
X, let EX,s be the set Xs of all subsets of X of size s.
Lemma 11.2 (Completing Tensors with Advice Vertices). Let 0 < q < p ≤ 1 be such that min{q, 1 − q} =
ΩN (1) and let s be a constant. Let 0 < Q < p be given by
 
Q = 1 − (1 − p)1−1/t (1 − q)1/t + 1{p=1} q 1/t − 1

83
where t = (s!)2 . Let V be an arbitrary (s−1)-subset of [N ] and let A denote A DVICE -C OMPLETE -T ENSOR
with input H, output T , advice vertices V and parameters p and q. Then A runs in poly(N ) time and satisfies
 
A ME[N ],s (ES∪V,s , Bern(p), Bern(q)) ∼ M([N ]\V )s (S s , Bern(p), Bern(Q))
 
A ME[N ],s (Bern(q)) ∼ M([N ]\V )s (Bern(Q))

for all subsets S ⊆ [N ] disjoint from V .


Proof. First note that Step 2 of A is well defined since the fact that |P (I)| = |{i1 , i2 , . . . , is }| implies
that {v1 , v2 , . . . , vs−|P (I)| } ∪ {i1 , i2 , . . . , is } is always a set of size s. We first consider the case in which
H ∼ ME[N ],s (ES∪V,s , Bern(p), Bern(q)). By Lemma 7.4, it follows that the hyperedge indicators of H σ1 ,σ2
are all independent and distributed as

σ1 ,σ2 Bern(p) if e ⊆ S ∪ V
1 {e ∈ E (H )} ∼
Bern(Q) otherwise
for each σ1 , σ2 ∈ Ss and subset e ⊆ [N ] with |e| = s. We now observe that T agrees in its entrywise
marginal distributions with M([N ]\V )s (S s , Bern(p), Bern(Q)). In particular, we have that:
• if (i1 , i2 , . . . , is ) is such that ij ∈ S for all 1 ≤ j ≤ s then we have that {v1 , v2 , . . . , vs−|P (I)| } ∪
{i1 , i2 , . . . , is } ⊆ S ∪ V and hence
n  o
Ti1 ,i2 ,...,is = 1 {v1 , v2 , . . . , vs−|P (I)| } ∪ {i1 , i2 , . . . , is } ∈ E H τP (I),τV (I) ∼ Bern(p)

• if (i1 , i2 , . . . , is ) is such that there is some j such that ij j 6∈ S, then {v1 , v2 , . . . , vs−|P (I)| } ∪
{i1 , i2 , . . . , is } 6⊆ S ∪ V and Ti1 ,i2 ,...,is ∼ Bern(Q).
It suffices to verify that the entries of T are independent. Since all of the hyperedge indicators of the H σ1 ,σ2
are independent, it suffices to verify that the entries of T are equal to distinct hyperedge indicators.
To show this, we will show that {i1 , i2 , . . . , is }, τP (I) and τV (I) determine the tuple I = (i1 , i2 , . . . , is ),
from which the desired result follows. Consider the longest increasing subsequence of τP (I) starting with
(τP (I)) (1). The elements of this subsequence partition τP (I) into contiguous subsequences corresponding
to the parts of P (I). Thus τP (I) determines P (I). Now the first |P (I)| elements of τV (I) along with
{i1 , i2 , . . . , is } determine the values vj in Definition 11.1 corresponding to I on each part of P (I). This
uniquely determines the tuple I. Therefore the entries Ti1 ,i2 ,...,is all correspond to distinct hyperedge indi-
cators and are therefore independent. Applying this argument with S = ∅ yields the second identity in the
statement of the lemma. This completes the proof of the lemma.
We now analyze the additional subroutine I TERATE - AND -R EDUCE. This will show it suffices to design
a reduction with low total variation error in order to show computational lower bounds for Tensor PCA. Let
k- PSTsE (N, k, p, q) denote the following planted subtensor hypothesis testing problem with hypotheses
H0 : T ∼ M[N ]s (Bern(q)) and H1 : T ∼ M[N ]s (S s , Bern(p), Bern(q))
where S is chosen uniformly at random from all k-subsets of [N ] intersecting each part of E in one element.
The next lemma captures our key guarantee of I TERATE - AND -R EDUCE.
Lemma 11.3 (Hardness of One-Sided Blackboxes by Reduction). Fix a pair 0 < q < p ≤ 1 with min{q, 1−
q} = Ω(1), a constant s and let Q be as in Figure 11. Suppose that √ there is a reduction mapping both
hypotheses of k- PSTsE (N − (s − 1)N/k, k − s + 1, p, Q) with k = o( N ) to the corresponding hypotheses
H0 and H1 of a testing problem P within total variation O(N −s ). Then the k- HPCs or k- HPDSs conjecture
for constant 0 < q < p ≤ 1 implies that there cannot be a poly(n) time algorithm A solving P with a low
false positive probability of PH0 [A(X) = H1 ] = O(N −s ), where X denotes the observed variable in P.

84
Proof. Assume for contradiction that there is a such a poly(n) time algorithm A for P with PH0 [A(X) =
H1 ] = O(N −s ) and Type I+II error

PH0 [A(X) = H1 ] + PH1 [A(X) = H0 ] ≤ 1 − 

for some  = Ω(1). Furthermore, let R denote the reduction described in the lemma. If H00 and H10 denote
the hypotheses of k- PSTsE (N − (s − 1)N/k, k − s + 1, p, Q) and T denotes an instance of this problem,
then R satisfies that
       
dTV R LH00 (T ) , LH0 (T ) + dTV R LH10 (T ) , LH1 (T ) = O(N −s )

Now consider
√ applying I TERATE - AND -R EDUCE to: (1) a hard instance H of k- HPDS(N, k, p, q) with
k = o( N ); and (2) the blackbox B = A ◦ R. Let IR(H) ∈ {H000 , H100 } denote the output of I TERATE -
AND -R EDUCE on input H, and let H000 and H100 be the hypotheses of k- HPDS (N, k, p, q). Furthermore, let
s−1 k 
T1 , T2 , . . . , TK denote the tensors formed in the K = Nk s−1 iterations of Step 1 of I TERATE - AND -
R EDUCE. Note that each Ti has all of its s dimensions equal to N − (s − 1)N/k since exactly s − 1 parts
of E of size N/k are removed from [N ] in each iteration of Step 1 of I TERATE - AND -R EDUCE. First con-
sider the case in which H000 holds. Each tensor in the sequence T1 , T2 , . . . , TK is marginally distributed as
M[N −(s−1)N/k]s (Bern(Q)) by Lemma 11.2. By definition IR(H) = H100 if and only if some application of
B(Ti ) outputs H1 . Now note that by a union bound, the definition of dTV and the data-processing inequality,
we have that
K
X
H100
 
PH000 IR(H) = ≤ PH000 [A ◦ R(Ti ) = H1 ]
i=1
XK h    i
≤ PH0 [A(X) = H1 ] + dTV R LH00 (T ) , LH0 (T )
i=1
= O K · N −s = O(N −1 )


since K = O(N s−1 ). Now suppose that H100 holds and let i∗ be the first iteration of I TERATE - AND -R EDUCE
in which each of the vertices {v1 , v2 , . . . , vs−1 } are in the planted dense sub-hypergraph of H. Lemma 11.2
shows that Ti∗ is distributed as M[N −(s−1)N/k]s (S s , Bern(p), Bern(Q)) where S is chosen uniformly at
random over all (k − s + 1)-subsets of [N − (s − 1)N/k] with one element per part of the input partition E
associated with H. We now have that

PH100 IR(H) = H000 ≤ 1 − PH100 IR(H) = H100 ≤ 1 − PH100 [A ◦ R(Ti∗ ) = H1 ]


   
   
≤ 1 − PH1 [A(X) = H1 ] + dTV R LH10 (T ) , LH1 (T )
= PH1 [A(X) = H0 ] + O(N −s )

Therefore the Type I+II error of I TERATE - AND -R EDUCE is

PH000 IR(H) = H100 + PH100 IR(H) = H000 = PH1 [A(X) = H0 ] + O(N −1 ) ≤ 1 −  + O(N −1 )
   

and I TERATE - AND -R EDUCE solves k- HPDS, contradicting the k- HPDS conjecture.

85
Part III
Computational Lower Bounds from PCρ
12 Secret Leakage and Hardness Assumptions
In this section, we further discuss the conditions in the PCρ conjecture and provide evidence for it and for
the specific hardness assumptions we use in our reductions. In Section 12.1, we show that k- HPCs is our
strongest hardness assumption, explicitly give the ρ corresponding to each of these hardness assumptions
and show that the barriers in Conjecture 2.3 are supported by the PCρ conjecture for these ρ. In Section 12.2,
we give more general evidence for the PCρ conjecture through the failure of low-degree polynomial tests.
We also discuss technical conditions in variants of the low-degree conjecture and how these relate to the PCρ
conjecture. Finally, in Section 12.3, we give evidence supporting several of the barriers in Conjecture 2.3
from statistical query lower bounds.
We remark that, as mentioned at the end of Section 2, all of our results and conjectures for PCρ appear
to also hold for PDSρ at constant edge densities 0 < q < p ≤ 1. Evidence for these extensions to PDSρ from
the failure of low-degree polynomials and SQ algorithms can be obtained through computations analogous
to those in Sections 12.2 and 12.3.

12.1 Hardness Assumptions and the PCρ Conjecture


In this section, we continue the discussion of the PCρ conjecture from Section 2. We first show that k- HPCs
reduces to the other conjectured barriers in Conjecture 2.3. We then formalize the discussion in Section 2
and explicitly construct secret leakage distributions ρ such that the graph problems in Conjecture 2.3 can be
obtained from instances of PCρ with these ρ. We then verify that the PCρ conjecture implies Conjecture 2.3
up to arbitrarily small polynomial factors. More precisely, we verify that these ρ, when constrained to be in
the conjecturally hard parameter regimes in Conjecture 2.3, satisfy the tail bound conditions on pρ (s) in the
PC ρ conjecture.

The k- HPCs Conjecture is the Strongest Hardness Assumption. First note that when s = 2, our conjec-
tured hardness for k- HPCs is exactly our conjectured hardness for k- PC in Conjecture 2.3. Thus it suffices
to show that Conjecture 2.3 for k- HPCs implies the conjecture for k- BPC and BPC. This is the content of the
following lemma.
Lemma 12.1. Let α be a fixed positive rational number and w = w(n) be an arbitrarily slow-growing func-
tion with w(n) → ∞. Then there is a positive integer s and a poly(n) time reduction from k- HPCs (n, k, 1/2)

with k = o( n) to either k- BPC(M,
√ N, kM , kN , 1/2)
√ or BPC(M, N,√ kM , kN , 1/2) for
√ some parameters sat-
α
isfying M = Θ(N ) and Cw −1 N ≤ kN = o( N ) and Cw −1 M ≤ kM = o( M ) for some positive
constant C > 0.
Proof. We first describe the desired reduction to k- BPC. Let α = a/b for two fixed integers a and b, and let
a+b
H be an input instance of k- HPCE (n, k, 1/2) where E is a fixed known partition of [n]. Suppose that H
−1/ max(a,b) √ √
is a nearly tight instance with w n ≤ k = o( n). Now consider the following reduction:
1. Let R1 , RS
2 , . . . , Ra+b be a partition of [k] into a + b sets of sizes differing by at most 1, and let
E(Rj ) = i∈Rj Ei for each j ∈ [a + b].

2. Form the bipartite graph G with left vertex set indexed by V1 = E(R1 ) × E(R2 ) × · · · × E(Ra ) and
right vertex set V2 = E(Ra+1 ) × E(Ra+2 ) × · · · × E(Ra+b ) such that (u1 , u2 , . . . , ua ) ∈ V1 and
(v1 , v2 , . . . , vb ) ∈ V2 are adjacent if and only if {u1 , . . . , ua , v1 , . . . , vb } is a hyperedge of H.

86
3. Output G with left parts Ei1 × Ei2 × · · · × Eia for all (i1 , i2 , . . . , ia ) ∈ R1 × R2 × · · · × Ra and
right parts Ei1 × Ei2 × · · · × Eib for all (i1 , i2 , . . . , ib ) ∈ Ra+1 × Ra+2 × · · · × Ra+b , after randomly
permuting the vertex labels of G within each of these parts.

Note that since a + b = Θ(1), we have that |E(Ri )| = Θ(n) for each i and thus N = |V2 | = Θ(nb ) and
M = |V1 | = Θ(na ) = Θ(N α ). Under H0 , each possible hyperedge of H is included independently with
probability 1/2. Since the edge indicators of G corresponds to a distinct hyperedge indicator of H in Step 2
above, it follows that each edge of G is also included with probability 1/2 and thus G ∼ GB (M, N, 1/2).
In the case of H1 , suppose that H is distributed according to the hypergraph planted clique distribution
with clique vertices S ⊆ [n] where S ∼ Un (E). Examining the definition of the edge indicators in Step 2
Q yields that G is a sample from H1 of k- BPC
above Q (M, N, kM , kN , 1/2) conditioned on having left biclique
set ai=1 (S ∩ E(Ri )) and right biclique set a+b i=a+1 (S ∩ E(Ri )). Observe that these sets have exactly
one vertex in G in common with each of the parts described in Step 3 above. Now note that since S
has oneQ vertex per part of E, we have that |S ∩ E(Ri )| = |Ri | = Θ(k) since a + b = Θ(1). Thus
kM = | ai=1 (S ∩ E(Ri ))| = Θ(k a ) and kN = Θ(k b ). The bound on k now implies that the two desired
bounds on kN and kM hold for a sufficiently small constant C > 0. Thus the permutations in Step 3
produce a sample exactly from k- BPC(M, N, kM , kN , 1/2) in the desired parameter regime. If instead of
only permuting vertex labels within each part, we randomly permute all left vertex labels and all right vertex
labels in Step 3, the resulting reduction produces BPC instead of k- BPC. The correctness of this reduction
follows from the same argument as for k- BPC.

We remark that since m and n are polynomial in each other in the setup in Conjecture 2.3 for k- BPC and
BPC , the lemma above fills out a dense subset of this entire parameter regime – where m= Θ(nα ) for some
rational α. In the case where α is irrational, the reduction in Lemma 12.1, when composed with our other
reductions beginning with k- BPC and BPC, shows tight computational lower bounds up to arbitrarily small
polynomial factors n by approximating α arbitrarily closely with a rational number.

Hardness Conjectures as Instances of PCρ . We now will verify that each of the graph problems in Con-
jecture 2.3 can be obtained from PCρ . To do this, we explicitly construct several ρ and give simple reductions
from the corresponding instances of PCρ to these graph problems. We begin with k- PC, BPC and k- BPC as
their discussion will be brief.

Secrets for k- PC, BPC and k- BPC. Below are the ρ corresponding to these three graph problems. Both BPC
and k- BPC can be obtained by restricting to bipartite subgraphs of the PCρ instances with these ρ.

• k-partite PC: Suppose that k divides n and E is a partition of [n] into k parts of size n/k. By
definition, k- PCE (n, k, 1/2) is PCρ (n, k, 1/2) where ρ = ρk- PC (E, n, k) is the uniform distribution
Un (E) over all k-sets of [n] intersecting each part of E in one element.

• bipartite PC: Let ρBPC (m, n, km , kn ) be the uniform distribution over all (kn + km )-sets of [n + m]
with kn elements in {1, 2, . . . , n} and km elements in {n + 1, n + 2, . . . , n + m}. An instance of
BPC (m, n, km , kn , 1/2) can then be obtained by outputting the bipartite subgraph of PCρ (m+n, km +
kn , 1/2) with this ρ, consisting of the edges between left vertex set {n + 1, n + 2, . . . , n + m} and
right vertex set {1, 2, . . . , n}.

• k-part bipartite PC: Suppose that kn divides n, km divides m, and E and F are partitions of [n] and
[m] into kn and km parts of equal size, respectively. Let ρk- BPC (E, F, m, n, km , kn ) be uniform over
all (kn +km )-subsets of [n+m] with exactly one vertex in each part of both E and n+F . Here, n+F
denotes the partition of {n + 1, n + 2, . . . , n + m} induced by shifting indices in F by n. As with BPC,

87
k- BPC(m, n, km , kn , 1/2) can be realized as the bipartite subgraph of PCρ (m+n, km +kn , 1/2), with
this ρ, between the vertex sets {n + 1, n + 2, . . . , n + m} and {1, 2, . . . , n}.

Secret for k- HPCs . We first will give the secret ρ corresponding to k- HPCs for even s, which can be viewed
as roughly the pushforward of Un (E) after unfolding the adjacency tensor of k- HPCs . The secret for odd s
will then be obtained through a slight modification of the even case.
Suppose that s = 2t. Given a set S ⊆ [n], let Ptn (S) denote the subset of [nt ] given by
 
 t−1
X 
Ptn (S) = 1 + (aj − 1)nj : a0 , a1 , . . . , at−1 ∈ S
 
j=0

In other words, Ptn (S) is the set of all numbers x in [nt ] such that the base-n representation of x − 1
only has digits in S − 1, where S − 1 is the set of all s − 1 where s ∈ S. Note that if |S| = k then
|Ptn (S)| = k t . Given a partition E of [n] into k parts of size n/k, let ρk- HPCs (E, n, k) be the distribution
over k t -subsets of [nt ] sampled by choosing S at random from Un (E) and P outputting Ptn (S). Throughout
the rest of this section, we will let I(a0 , a1 , . . . , at−1 ) denote the sum 1 + t−1 j
j=0 (aj − 1)n . We now will
s t t
show that k- HPCE (n, k, 1/2) can be obtained from PCρ (n , k , 1/2) where ρ = ρk- HPCs (E, n, k). Intuitively,
this instance of PCρ has a subset of edges corresponding to the unfolded adjacency tensor of - HPCsE . More
formally, consider the following steps.

1. Let G be an input instance of PCρ (nt , k t , 1/2) and let H be the output hypergraph with vertex set [n].

2. Construct H as follows: for each possible hyperedge e = {a1 , a2 , . . . , a2t }, with 1 ≤ a1 < a2 <
· · · < a2t ≤ n, include e in H if and only if there is an edge between vertices I(a1 , a2 , . . . , at ) and
I(at+1 , at+2 , . . . , a2t ) in G.

Under H0 , it follows that G ∼ G(nt , 1/2). Note that each hyperedge e in Step 2 identifies a unique pair
of distinct vertices I(a1 , a2 , . . . , at ) and I(at+1 , at+2 , . . . , a2t ) in G, and thus the hyperedges of H are
independently included with probability 1/2. Under H1 , it follows that the instance of PCρ (nt , k t , 1/2)
is sampled from the planted clique distribution with clique vertices Ptn (S) where S ∼ Un (E). By the
definition of Ptn (S), it follows that I(a1 , a2 , . . . , at ) is in this clique if and only if a1 , a2 , . . . , at ∈ S.
Examining the edge indicators of H then yields that H is a sample from the hypergraph planted clique
distribution with clique vertex set S. Since S ∼ Un (E), under both H0 and H1 , it follows that H is a
sample from k- HPCs .
Now suppose that s is odd with s = 2t + 1. The idea in this case is to pair up adjacent digits in base-n
expansions and use these pairs to label the vertices of k- HPCs . More precisely suppose that n = N 2 and
k = K 2 for some positive integers K and N . Let E be a fixed partition of [n] into k = K 2 equally sized
parts and let ρk- HPCs (E, n, k) be ρk- HPC2s (F, N, K) as defined above for the even number 2s, where F is a
fixed partition of [N ] into K equally sized parts. We now will show that k- HPCsE (n, k, 1/2) can be obtained
from PCρ (N s , K s , 1/2) where ρ = ρk- HPC2s (F, PN, K). Let I 0 be the analogue of I for base-N expansions
t−1
i.e. let I(b0 , b1 , . . . , bt−1 ) denote the sum 1 + j=0 (bj − 1)N j . Consider the following steps.

1. Let G be an instance of PCρ (N s , K s , 1/2) and let H be the output hypergraph with vertex set [n].

2. Let σ : [n] → [n] be a bijection such that, for each i ∈ [k], we have that

σ(Ei ) = I 0 (b0 , b1 ) : b0 ∈ Fc0 and b1 ∈ Fc1




where c0 , c1 are the unique elements of [K] with i − 1 = (c0 − 1) + (c1 − 1)K.

88
3. Construct H as follows. For each possible hyperedge e = {a1 , a2 , . . . , as }, with 1 ≤ a1 < a2 <
· · · < as ≤ n, let b2i−1 , b2i be the unique elements of [N ] with I 0 (b2i−1 , b2i ) = σ(ai ) for each i.
Now include e in H if and only if there is an edge between the two vertices I(b1 , b2 , . . . , bs ) and
I(bs+1 , bs+2 , . . . , b2s ) in G.

4. Permute the vertex labels of H within each part Fi uniformly at random.

Note that σ always trivially exists because the K 2 sets E1 , E2 , . . . , EK 2 and the K 2 sets Fi,j 0 = {I 0 (b , b ) :
0 1
b0 ∈ Fi and b1 ∈ Fj } for 1 ≤ i, j ≤ K are both partitions of [n] into parts of size N /K 2 . As in the
2

case where s is even, under H1 we have that G ∼ G(N 2s , 1/2) and the hyperedges of H are independently
included with probability 1/2, since Step 3 identifies distinct pairs of vertices for each hyperedge e. Under
H1 , let S ∼ UN (F ) be such that the clique vertices in G are PsN (S). By the same reasoning as in the
even case, after Step 3, the hypergraph H is distributed as a sample from the hypergraph planted clique
distribution with clique vertex set σ −1 (I 0 (S, S)) where I 0 (S, S) = {I 0 (s0 , s1 ) : s0 , s1 ∈ S}. The definition
of σ now ensures that this clique has one vertex per part of E. Step 4 ensures that the resulting hypergraph
is exactly a sample from H1 of k- HPCs . We remark that the conditions n = N 2 and k = K 2 do not
affect our lower bounds when composing the reduction above with our other reductions. This is due to the
subsequence criterion for computational lower bounds in Condition 6.1.

Verifying the Conditions of the PCρ Conjecture. We now verify that the PCρ conjecture corresponds to
the hard regimes in Conjecture 2.3 up to arbitrarily small polynomial factors. To do this, it suffices to verify
the tail bound on pρ (s) in the PCρ conjecture for each ρ described above, which is done in the theorem below.
In the next section, we will show that a slightly stronger variant of the PCρ conjecture implies Conjecture
2.3 exactly, without the small polynomial factors.

Theorem 12.2 (PCρ Conjecture and Conjecture 2.3). Suppose that m and n are polynomial in one another
and let  > 0 be an arbitrarily small constant. Let ρ be any one of the following distributions:

1. ρk- PC (E, n, k) where k = O(n1/2− );

2. ρBPC (m, n, km , kn ) where kn = O(n1/2− ) and km = O(m1/2− );

3. ρk- BPC (E, F, m, n, km , kn ) where kn = O(n1/2− ) and km = O(m1/2− ); and

4. ρk- HPCt (E, n, k, 1/2) for t ≥ 3 where k = O(n1/2− ).

Then there is a constant δ > 0 such that: for any parameter d = On ((log n)1+δ ), there is some p0 = on (1)
such that pρ (s) satisfies the tail bounds
( 2
2−s if 1 ≤ s2 < d
pρ (s) ≤ p0 ·
s−2d−4 if s2 ≥ d

Proof. We first prove the desired tail bounds hold for (1). Let C > 0 be a constant such that k ≤ Cn1/2− .
Note that the probability that S and S 0 independently sampled from ρ = ρk- PC (E, n, k) intersect in their
elements in Ei is 1/|Ei | = k/n for each 1 ≤ i ≤ k. Furthermore, these events are independent. Thus it
follows that if ρ = ρk- PC (E, n, k), then pρ is the PMF of Bin(k, k/n). In particular, we have that
   s   s  2 s
k k−s

k k k k
pρ (s) = 1− s
≤k · = ≤ C 2s · n−2s
s n n n n

89
Let p0 = p0 (n) be a function tending to zero arbitrarily slowly. The bound above implies that pρ (s) ≤
2
p0 · 2−s as long as s ≤ C1 log n for some sufficiently small constant C1 > 0. Furthermore a direct
computation verifies that pρ (s) ≤ p0 · s−2d−4 as long as
C2 d log d
s≥
log n
for some sufficiently
√ large constant
√ C2 > 0. Thus if d = On ((log n)1+δ ) for some δ ∈ (0, 1), then
C2 d log d
log n < d and C1 log n > d for sufficiently large n. This implies the desired tail bound for (1).
The other three cases are similar. In the case of (3), if S and S 0 are independently sampled from
ρ = ρk- BPC (E, F, m, n, km , kn ), then the probability that S and S 0 intersect in their elements in Ei is
kn /n for each 1 ≤ i ≤ kn , and the probability that they intersect in their elements in n + Fi is km /m
for each 1 ≤ i ≤ km . Thus pρ is distributed as independent sum of samples from Bin(km , km /m) and
Bin(kn , kn /n). It follows that
s  
kn ` kn kn −` km km −s+`
      s−`  
X kn km km
pρ (s) = 1− · 1−
` n n s−` m m
`=0
s ` s−` s s
kn2 2 kn2 2
        
X km km
≤ ≤ s · max , (9)
n m n m
`=0

Repeating the bounding argument as in (1) shows that the desired tail bound holds for (3) if d = On ((log n)1+δ )
for some δ ∈ (0, 1). Since m and n are polynomial in one another implies that log m = Θ(log n), the
2 /m)s term and the additional factor of s do not affect this bounding argument other than changing the
(km
constants C1 and C2 . In the case of (2), similar reasoning as in (3) yields that the distribution pρ where
ρ = ρBPC (m, n, km , kn ) is the independent sum of samples from Hyp (n, kn , kn ) and Hyp (m, km , km ).
Now note that
kn n−kn ` n−` `−1
Y kn − i  k 2 `
  
` kn −` k n kn −` ` n
P [Hyp (n, kn , kn ) = `] = n
 ≤ n
 = kn ≤
kn kn
n − i n
i=0

This implies the same upper bound on pρ (s) as in Equation (9) also holds for ρ in the case of (2). The
argument above for (3) now establishes the desired tail bounds for (2).
We first handle the case in (4) where t is even with t = 2r. We have that ρ = ρk- HPCt (E, n, k, 1/2)
can be sampled as Prn (S) ⊆ [nr ] where S ∼ Un (E). Thus pρ (s) is the PMF of |Prn (S) ∩ Prn (S 0 )| where
S, S 0 ∼i.i.d. Un (E). Furthermore the definition of Prn implies that |Prn (S) ∩ Prn (S 0 )| = |S ∩ S 0 |r and, from
case (1), we have that |S ∩ S 0 | ∼ Bin(k, k/n). It now follows that
(  k s1/r  1/r
k k k−s
pρ (s) = s 1/r n 1 − n if s is an rth power
0 otherwise
1/r
The same bounds as in case (1) therefore imply that pρ (s) ≤ (k 2 /n)s for all s ≥ 0. A similar analysis as
2
in (1) now shows that pρ (s) ≤ p0 · 2−s holds if s ≤ C1 (log n)r/(2r−1) for some sufficiently small constant
C1 > 0, and that pρ (s) ≤ p0 · s−2d−4 holds if
d log d r
 
s ≥ C2
log n

for some sufficiently large constant C2 > 0. As long as d = On ((log n)1+δ ) for some 0 < δ < 1/(2r −
r √ √
1), we have that C2 dlog log d
n < d and C1 (log n)r/(2r−1) > d for sufficiently large n. Since t and

90
r are constants here, δ can be taken to be constant√as well. In the case where t is odd, it follows that

ρk- HPCt (E, n, k, 1/2) is the same as ρk- HPC2t (F, n, k, 1/2) for some partition F as long as n and k are
squares. The same argument establishes the desired tail bound for this prior, completing the case of (4) and
proof of the theorem.

12.2 Low-Degree Polynomials and the PCρ Conjecture


In this section, we show that the low-degree conjecture – that low-degree polynomials are optimal for a
class of average-case hypothesis testing problems – implies the PCρ conjecture. In particular, we will obtain
a simple expression capturing the power of the optimal low-degree polynomial for PCρ in Proposition 12.7.
We then will apply this proposition to prove Theorem 12.8, showing that the power of this optimal low-
degree polynomial tends to zero under the tail bounds on pρ in the PCρ conjecture. We also will discuss a
stronger version of the PCρ conjecture that exactly implies Conjecture 2.3. First, we informally introduce
the low-degree conjecture and the technical conditions arising in its various formalizations in the literature.

Polynomial Tests and the Low-Degree Conjecture. In this section, will draw heavily from similar dis-
cussions in [HS17] and Hopkins’s thesis [Hop18]. Throughout, we will consider discrete hypothesis testing
problems with observations taken without loss of generality to lie in the discrete hypercube {−1, 1}N . For
example, an n-vertex instance of planted clique can be represented in the discrete hypercube by the above-
diagonal entries of its signed adjacency matrix when N = n2 . Given a hypothesis H0 , the term D-simple


statistic refers to polynomials f : {−1, 1}N → R of degree at most D in the coordinates of {−1, 1}N that
are calibrated and normalized so that EH0 f (X) = 0 and EH0 f (X)2 = 1.
For a broad range of hypothesis testing problems, it has been observed in the literature that D-simple
statistics seem to capture the full power of the SOS hierarchy [HS17, Hop18]. This trend prompted a fur-
ther conjecture that D-simple statistics often capture the full power of efficient algorithms, leading more
concretely to the low-degree conjecture which is stated informally below. This conjecture has been used
to gather evidence of hardness for a number of natural detection problems and has generally emerged as a
convenient tool to predict statistical-computational gaps [HS17, Hop18, KWB19, BKW19]. Variants of this
low-degree conjecture have appeared as Hypothesis 2.1.5 and Conjecture 2.2.4 in [Hop18] and Conjectures
1.16 and 4.6 in [KWB19].

Conjecture 12.3 (Informal – Hypothesis 2.1.5 in [Hop18]). For a broad class of hypothesis testing problems
H0 versus H1 , there is a test running in time N Õ(D) with Type I+II error tending to zero if and only if there
is a successful D-simple statistic i.e. a polynomial f of degree at most D such that EH0 f (X) = 0 and
EH0 f (X)2 = 1 yet EH1 f (X) → ∞.

Detailed discussions of the low-degree conjecture and the connections between D-simple statistics and
other types of algorithms can be found in [KWB19] and [HW20]. The informality in the conjecture above
is the undefined “broad class” of hypothesis testing problems. In [Hop18], several candidate technical
conditions defining this class were proposed and subsequently have been further refined in [KWB19] and
[HW20]. These conditions are discussed in more detail later in this section.
The utility of the low-degree conjecture in predicting statistical-computational gaps arises from the fact
that the optimal D-simple statistic can be explicitly characterized. By the Neyman-Pearson lemma, the
optimal test with respect to Type I+II error is the the likelihood ratio test, which declares H1 if LR(X) =
PH1 (X)/PH0 (X) > 1 and H0 otherwise, given a sample X. Computing the likelihood ratio is typically
intractable in problems in high-dimensional statistical inference. The low-degree likelihood ratio LR≤D
is the orthogonal projection of the likelihood ratio onto the subspace of polynomials of degree at most
D. When H0 is a product distribution on the discrete hypercube {−1, 1}N , the following theorem asserts

91
that LR≤D is the optimal test of a given degree. Here, the projection is with respect to the inner product
hf, gi = EH0 f (X)g(X), which also defines a norm kf k22 = hf, f i.
Theorem 12.4 (Page 35 of [Hop18]). The optimal D-simple statistic is the low-degree likelihood ratio, i.e.
it holds that
EH f (X)
max p 1 = kLR≤D − 1k2
f ∈R[x]≤D EH0 f (X) 2
EH f (X)=0
0

Thus existence of low-degree tests for a given problem boils down to computing the norm of the
low-degree likelihood ratio. When H0 is the uniform distribution on {−1, 1}N , the norm above can be
re-expressed in terms of the standard Boolean Fourier basis. Let the collection of functions {χα (X) =
N
Q
e∈α Xe : α ⊆ [N ]} denote this basis, which is orthonormal over the space {−1, 1} with inner product
defined above. By orthonormality, any χα with 1 ≤ |α| ≤ D satisfies that

hχα , LR≤D − 1i = hχα , LRi = EH0 χα (X)LR(X) = EH1 χα (X)

and EH0 LR≤D = EH1 1 = 1 so that h1, LR≤D − 1i = 0. It then follows by Parseval’s identity that
 1/2
2
X
kLR≤D − 1k2 = 

EH1 χα (X)  (10)
1≤|α|≤D

which is exactly the Fourier energy up to degree D.

Technical Conditions, Sn -Invariance and Counterexamples. While Conjecture 12.3 is believed to accu-
rately predict the computational barriers in nearly any natural high-dimensional statistical problem including
all of the problems we consider, a precise set of criteria exactly characterizing this “broad class” has yet to
be pinned down in the literature. The following was the first formalization of the low-degree conjecture,
which appeared as Conjecture 2.2.4 in [Hop18].
Conjecture
n
 12.5 (Conjecture 2.2.4 in [Hop18]).NLet Ω be a finite set or R, and let k be a N
fixed integer. Let
N = k , let ν be a product distribution on Ω and let µ be another distribution on Ω . Suppose that
µ is Sn -invariant and (log n)1+Ω(1) -wise almost independent with respect to ν. Then no polynomial time
test distinguishes Tδ µ and ν with probability 1 − o(1), for any δ > 0. Formally, for all δ > 0 and every
polynomial-time test t : ΩN → {0, 1} there exists δ 0 > 0 such that for every large enough n,
1 1
Px∼ν [t(x) = 0] + Px∼Tδ µ [t(x) = 1] ≤ 1 − δ 0
2 2
This conjecture has several key technical stipulations attempting to conservatively pin down the Õ in
Conjecture 12.3 and a set of sufficient conditions to be in this “broad class”. We highlight and explain these
key conditions below.
1. The distribution µ is required to be Sn -invariant. Here, a distribution µ on ΩN is said to be Sn -
invariant if Pµ (x) = Pµ (π · x) for all π ∈ Sn and x ∈ ΩN , where π acts on x by identifying the
coordinates of x with the k-subsets of [n] and permuting these coordinates according to the permuta-
tion on k-subsets induced by π.

2. The (log n)1+Ω(1) -wise almost independence requirement on µ essentially enforces that polynomials
of degree at most (log n)1+Ω(1) are unable to distinguish between µ and ν. More formally, a distri-
bution µ is D-wise almost independent with respect to ν if every D-simple statistic f , calibrated and
normalized with respect to ν, satisfies that Ex∼µ f (x) = O(1).

92
3. Rather than µ, the distribution the conjecture asserts is hard to distinguish from ν is the result Tδ µ of
applying the noise operator Tδ . Here, the distribution Tδ µ is defined by first sampling x ∼ µ, then
sampling y ∼ ν and replacing each xi with yi independently with probability δ.
These technical conditions are intended to conservatively rule out specific pathological examples. As men-
tioned in [Hop18], the purpose of Tδ is to destroy algebraic structure that may lead to efficient algorithms
that cannot be implemented with low-degree polynomials. For example, if µ uniform over the solution set
to a satisfiable system of equations mod 2 and ν is the uniform distribution, it is possible to distinguish
these two distributions through Gaussian elimination while the lowest D for which a D-simple statistic does
so can be as large as D = Ω(N ). The noise operator Tδ rules out distributions with this kind of algebraic
structure. The (log n)1+Ω(1) -wise requirement on the almost independence of µ and the Õ(D) in Conjecture
12.3 are both to account for the fact that some common polynomial time algorithms for natural hypothesis
testing problems can only be implemented as degree O(log n) polynomials. For example, Section 4.2.3 of
[KWB19] shows that spectral methods can typically be implemented as degree O(log n) polynomials.
In [Hop18], it was mentioned that the Sn -invariance condition was included in Conjecture 12.5 mainly
because most canonical inference problems satisfy this property and, furthermore, that there were no existing
counterexamples to the conjecture without it. Recently, [HW20] gave two construction of hypothesis testing
problems based on efficiently-correctable binary codes and Reed-Solomon codes. The first construction is
for binary Ω and admits a polynomial-time test despite being Ω(n)-wise almost independent. This shows that
Tδ is insufficient to always rule out high-degree algebraic structure that can be used in efficient algorithms.
However, this construction also is highly asymmetric and ruled out by Sn -invariance condition in Conjecture
12.5. The second construction is for Ω = R and admits a polynomial-time test despite being both Ω(n)-wise
almost independent and Sn -invariant, thus falsifying Conjecture 12.5 as stated. However, as discussed in
[HW20], the conjecture can easily be remedied by replacing Tδ with another operator, such as the Ornstein-
Uhlenbeck noise operator. In this work, only the case of binary Ω will be relevant to the PCρ conjecture.

The PCρ Conjecture, Technical Conditions and a Generalization. The PCρ hypothesis testing problems
and their planted dense subgraph generalizations PDSρ that we consider in this work can be shown to satisfy
a wide range of properties sufficient to rule out known counterexamples to the low-degree conjecture. In
particular, these problems almost satisfy all three conservative conditions proposed in [Hop18], instead
satisfying a milder requirement for sufficient symmetry than full Sn -invariance.
1. By definition, a general instance of PCρ with an arbitrary ρ is only invariant to permutations π ∈ Sn
that ρ is also invariant to. However, each of the specific hardness assumptions we use in our reductions
corresponds to a ρ with a large amount of symmetry and that is invariant to large subgroups of Sn .
For example, k- PC and k- PDS are invariant to permutations within each part Ei , each of which has

size n/k = ω( n). This symmetry seems sufficient to break the error-correcting code approach used
to construct counterexamples to the low-degree conjecture in [HW20].

2. As will be shown subsequently in this section, the conditions in the PCρ conjecture imply that a PCρ
instance be (log n)1+Ω(1) -wise almost independent for it to be conjectured to be hard.

3. While PCρ is not of the form Tδ µ, its generalization PDSρ at any pair of constant edge densities
0 < q < p < 1 always is. All of our reductions also apply to input instances of PDSρ and thus a PDSρ
variant of the PCρ conjecture is sufficient to deduce our computational lower bounds. That said, we
do not expect that the computational complexity of PCρ and PDSρ to be different as long as p and q
are constant.
As mentioned in Section 2, while we restrict our formal statement of the PCρ conjecture to the specific
hardness assumptions we need for our reductions, we believe it should hold generally for ρ with sufficient

93
symmetry. A candidate condition is that ρ is invariant to a subgroup H ⊆ Sn of permutations such that, for
each index i ∈ [n], there are at least nΩ(n) permutations π ∈ H with π(i) 6= i. This ensures that ρ has a
large number of nontrivial symmetries that are not just permuting coordinates known not to lie in the clique.
We also remark that there are many examples of hypothesis testing problems where the three conditions
in [Hop18] are violated but low-degree polynomials still seem to accurately predict the performance of the
best known efficient algorithms. As mentioned in [HW20], the spiked Wishart model does not quite satisfy
Sn -invariance but still low-degree predictions are conjecturally accurate. Ordinary PC is not of the form Tδ µ
and the low-degree conjecture accurately predicts the PC conjecture, which is widely believed to be true.

The Degree Requirement and a Stronger PCρ Conjecture. Furthermore, the degree requirement for the
almost independence condition of Conjecture 12.5 is often not exactly necessary. It is discussed in Section
4.2.5 of [KWB19] that, for sufficiently nice distributions H0 and H1 , low-degree predictions are often still
accurate when the almost independence condition is relaxed to only be ω(1)-wise for any ω(1) function of
n. This yields the following stronger variant of the PCρ conjecture.

Conjecture 12.6 (Informal – Stronger PCρ Conjecture). For sufficiently symmetric ρ, there is no polynomial
time algorithm solving PCρ (n, k, 1/2) if there is some function w(n) = ωn (1) such that the tail bounds on
pρ (s) in Conjecture 2.2 are only guaranteed to hold for all d ≤ w(n).

We conjecture that the ρ in Conjecture 2.3 are symmetric enough for this conjecture to hold. A nearly
identical argument to that in Theorem 12.2 can be used to show that this stronger PCρ conjecture implies the
exact boundaries in Conjecture 2.3, without the small polynomial error factors of O(n ) and O(m ).
We now make several notes on the degree requirement in the PCρ conjecture, as stated in Conjecture
2.2. As will be shown later in this section, the tail bounds on pρ (s) for a particular d directly imply the
d-wise almost independence of PCρ . Now note that for any ρ and k  log n, there is always a d-simple
statistic solving PCρ with d = O((log n)2 ). Specifically, G(n, 1/2) has its largest clique of size less than
(2 + ) log2 n with probability 1 − on (1) and any instance of H1 of PCρ with k  log n has nω(1) cliques
of size d3 log2 ne. Furthermore, the number of cliques of this size can be expressed as a degree O((log n)2 )
polynomial in the edge indicators of a graph. Similarly, the largest clique in an s-uniform Erdős-Rényi
hypergraph is in general of size O((log n)1/(s−1) ) and a simple clique-counting test distinguishing this from
the planted clique hypergraph distribution can be expressed as an O((log n)s/(s−1) ) degree polynomial.
This shows that for all ρ, the problem PCρ is not O((log n)2 )-wise almost independent. Furthermore, for
any δ > 0, there is some ρ corresponding to a hypergraph variant of PC such that PCρ is not O((log n)1+δ )-
wise almost independent. Thus the tail bounds in Conjecture 2.2 never hold for δ ≥ 1 and, for any δ 0 > 0,
there is some ρ requiring δ ≤ δ 0 for these tail bounds to be true.
Finally, we remark that there are highly asymmetric examples of ρ for which Conjecture 12.6 is not true.
Suppose that n is even, let c > 0 be an arbitrarily large integer and let S1 , S2 , . . . , Snc ⊆ [n/2] be a known
family of subsets of size d3 log2 ne. Now let ρ be sampled by taking the union of an Si chosen uniformly
at random and a size k − d3 log2 ne subset of {n/2 + 1, n/2 + 2, . . . , n} chosen uniformly at random.
The resulting PCρ problem can be solved in polynomial time by exhaustively searching for the subset Si .
However, this ρ only violates the tail bounds on pρ in Conjecture 2.2 for d = Ωn (log n/ log log n). If
S1 , S2 , . . . , Snc are sufficiently pseudorandom, then the structure of this ρ only appears in the tails of pρ (s)
when s ≥ d3 log2 ne. In particular, the probability that s ≥ d3 log2 ne under pρ is at least the chance that two
independent samples from ρ choose the same Si , which occurs with probability n−c . It can be verified the
the tail bound of p0 · s−2d−4 in Conjecture 2.2 only excludes this possibility when d = Ωn (log n/ log log n).
We remark though that this ρ is highly asymmetric and any mild symmetry assumption that would effectively
cause the number of Si to be super-polynomial would break this example.

94
The Low-Degree Conjecture and PCρ . We now will characterize the power of the optimal D-simple
statistics for PCρ . The following proposition establishes an explicit formula for LR≤D in PCρ , which will be
shown in the subsequent theorem to naturally yield the PMF decay condition in the PCρ conjecture.
Proposition 12.7. Let LR≤D be the low-degree likelihood ratio for the hypothesis testing problem PCρ (n, k, 1/2)
between G(n, 1/2) and Gρ (n, k, 1/2). For any D ≥ 1, it follows that
kLR≤D − 1k22 = ES,S 0 ∼ρ⊗2 # of nonempty edge subsets of S ∩ S 0 of size at most D
 

Proof. In the notation above, let N = n2 and identify X ∈ {−1, 1}N with the space of signed adjacency


matrices X of n-vertex graphs. Let PS be the distribution on graphs in this space induced by PC(n, k, 1/2)
conditioned on the clique being planted on the vertices in the subset S i.e. such that Xij = 1 if i ∈ S and
j ∈ S and otherwise Xij = ±1 with Q probability half each. Now let α ⊆ E0 be a subset of possible edges.
The set of functions {χα (X) = e∈α Xe : α ⊆ E0 } comprises the standard Fourier basis on {−1, 1}E0 .
/ S2 and non-clique edges are independent, we see that

For each fixed clique S, because EPS Xe = 0 if e ∈
EPS [χα (X)] = 1{V (α) ⊆ S}
We therefore have that
EH1 [χα (X)] = ES∼ρ EPS [χα (X)] = ES∼ρ [1{V (α) ⊆ S}] = Pρ [V (α) ⊆ S]
Now suppose that S 0 is drawn from ρ independently of S. It now follows that
EH1 [χα (X)]2 = ES∼ρ [1{V (α) ⊆ S}]2
= ES∼ρ [1{V (α) ⊆ S}] · ES 0 ∼ρ 1{V (α) ⊆ S 0 }
 

= ES,S 0 ∼ρ⊗2 1{V (α) ⊆ S} · 1{V (α) ⊆ S 0 }


 

= ES,S 0 ∼ρ⊗2 1 V (α) ⊆ S ∩ S 0


  

From Equation (10), we therefore have that


 
X X
kLR≤D − 1k22 = EH1 [χα (X)]2 = ES,S 0 ∼ρ⊗2  1 V (α) ⊆ S ∩ S 0 


1≤|α|≤D 1≤|α|≤D

Now observe that the sum X


1 V (α) ⊆ S ∩ S 0


1≤|α|≤D

counting the number of nonempty edge subsets of S ∩ S 0 of size at most D.


This proposition now allows us to show the main result of this section, which is that the condition in
the PCρ conjecture is enough to show the failure of low-degree polynomials for PCρ . Combining the next
theorem with Conjecture 12.3 would suggest that whenever the PMF decay condition of the PCρ condition
holds, there is no polynomial time algorithm solving PCρ (n, k, 1/2).
Theorem 12.8 (PCρ Implies Failure of Low-Degree). Suppose that ρ satisfies that for any parameter d =
On (log n), there is some p0 = on (1) such that pρ (s) satisfies the tail bounds
( 2
2−s if 1 ≤ s2 < d
pρ (s) ≤ p0 · −2d−4
s if s2 ≥ d

Let LR≤D be the low-degree likelihood ratio for the hypothesis testing problem PCρ (n, k, 1/2). Then it also
follows that for any parameter D = On (log n), we have
kLR≤D − 1k2 = on (1)

95
Proof. First observe that the number of nonempty edge subsets of S ∩ S 0 of size at most D can be expressed
explicitly as
D  
X s(s − 1)/2
fD (s) =
`
`=1

if s = |S ∩ S 0 |.
Furthermore, we can crudely upper bound fD in two separate ways. Note that the number of
s s
nonempty edge subsets of S ∩ S 0 is exactly 2(2) − 1 if s = |S ∩ S 0 |. Therefore we have that fD (s) ≤ 2(2) .
Furthermore using the upper bound that x` ≤ x` , we have that if s ≥ 3 then
 D+1
s(s−1)
D 
X s(s − 1)/2
 D 
X s(s − 1) `

2 −1
fD (s) = ≤ ≤   ≤ s2(D+1)
` 2 s(s−1)
−1
`=1 `=1 2
n s o
Combining these two crude upper bounds, we have that fD (s) ≤ min 2(2) , s2(D+1) . Also note that
fD (0) = fD (1) = 0. Combining this with the given bounds on pρ (s), we have that

kLR≤D − 1k22 = ES,S 0 ∼ρ⊗2 fD (|S ∩ S 0 |)


 

k
X
= pρ (s) · fD (s)
s=2
2
X X
≤ p0 · 2−s · fD (s) + p0 · s−2d−4 · fD (s)
1≤s2 <D D≤s2 ≤k2
s
2−s · 2( ) + p0 ·
2
X X
≤ p0 · 2 s−2D−4 · s2(D+1)
1≤s2 <D D≤s2 ≤k2
∞ ∞
−(s+1
2 )
X X
= p0 · 2 + p0 · s−2 = On (p0 )
s=1 s=1

which completes the proof of the theorem.

12.3 Statistical Query Algorithms and the PCρ Conjecture


In this section, we verify that the lower bounds shown by [FGR+ 13] for PC for a generalization of statistical
query algorithms hold essentially unchanged for SQ variants of k- PC, k- BPC and BPC. We remark at the end
of this section why the statistical query model seems ill-suited to characterizing the computational barriers
in problems that are tensor or hypergraph problems such as k- HPC. Since it was shown in Section 12.1 that
there are specific ρ in PCρ corresponding to k- HPC, it similarly follows that the SQ model seems ill-suited
to characterizing the barriers PCρ for general ρ. Throughout this section, we focus on k- PC, as lower bounds
in the statistical query model for k- BPC and BPC will follow from nearly identical arguments.

Distributional Problems and SQ Dimension. The Statistical Algorithm framework of [FGR+ 13] applies
to distributional problems, where the input is a sequence of i.i.d. observations from a distribution D. In
order to obtain lower bounds in the statistical query model supporting Conjecture 2.3, we need to define
a distributional analogue of k-PC. As in [FGR+ 13], a natural distributional version can be obtained by
considering a bipartite version of k- PC, which we define as follows.

Definition 12.9 (Distributional Formulation of k- PC). Let k divide n and fix a known partition E of [n] into k
parts E1 , E2 , . . . , Ek with |Ei | = n/k. Let S ⊆ [n] be a subset of indices with |S ∩ Ei | = 1 for each i ∈ [k].

96
The distribution DS over {0, 1}n produces with probability 1−k/n a uniform point X ∼ Unif({0, 1}n ) and
with probability k/n a point X with Xi = 1 for all i ∈ S and XS c ∼ Unif({0, 1})n−k . The distributional
bipartite k-PC problem is to find the subset S given some number of independent samples m from DS .

In other words, the distribution k- PC problem is k- BPC with n left and n right vertices, a randomly-sized
right part of the planted biclique and no k-partite structure on the right vertex set. We remark that many
of our reductions, such as our reductions to RSME, NEG - SPCA, MSLR and RSLSR, only need the k-partite
structure along one vertex set of k- PC or k- BPC. This distributional formulation of k- PC is thus a valid
starting point for these reductions.
We now formally introduce the Statistical Algorithm framework of [FGR+ 13] and SQ dimension. Let
X = {0, 1}n denote the space of configurations and let D be a set of distributions over X . Let F be a set
of solutions and Z : D → 2F be a map taking each distribution D ∈ D to a subset of solutions Z(D) ⊆ F
that are defined to be valid solutions for D. In our setting, F corresponds to clique positions S respecting
the partition E. Furthermore, since each clique position is in one-to-one correspondence with distributions,
there is a single clique Z(D) corresponding to each distribution D. For m > 0, the distributional search
problem Z over D and F using m samples is to find a valid solution f ∈ Z(D) given access to m random
samples from an unknown D ∈ D.
Classes of algorithms in the framework of [FGR+ 13] are defined in terms of access to oracles. The most
basic oracle is an unbiased oracle, which evaluates a simple function on a single sample as follows.

Definition 12.10 (Unbiased Oracle). Let D be the true unknown distribution. A query to the oracle consists
of any function h : X → {0, 1}, and the oracle then takes an independent random sample X ∼ D and
returns h(X).

Algorithms with access to an unbiased oracle are referred to as unbiased statistical algorithms. Since
these algorithms access the sampled data only through the oracle, it is possible to prove unconditional lower
bounds using information-theoretic methods. Another oracle is the V ST AT , defined below, which is similar
but also allowed to make an adversarial perturbation of the function evaluation. It is shown in [FGR+ 13]
via a simulation argument that the two oracles are approximately equivalent.

Definition 12.11 (V ST AT Oracle). Let D be the true distribution and t > 0 a sample size parameter. A
query to the V ST AT (t) oracle again consists of any function h : X →
p[0, 1], and the oracle returns an
arbitrary value v ∈ [ED h(X) − τ, ED h(X) + τ ], where τ = max{1/t, ED h(X)(1 − ED h(X))/t}.

We borrow some definitions from [FGR+ 13]. Given a distributionp D, we define the inner product
hf, giD = EX∼D f (X)g(X) and the corresponding norm kf kD = hf, f iD . Given two distributions D1
and D2 both absolutely continuous with respect to D, their pairwise correlation is defined to be
D D D2 E
1
χD (D1 , D2 ) = − 1, − 1 = |hD b 2 iD | .
b 1, D

D D D

where D b 1 = D1 − 1. The average correlation ρ(D, D) of a set of distributions D relative to distribution D


D
is then given by
1 X 1 X D D
1 D2 E
ρ(D, D) = χD (D1 , D2 ) = − 1, − 1 .

|D|2 |D|2 D D

D
D1 ,D2 ∈D D1 ,D2 ∈D

Given these definitions, we can now introduce the key quantity from [FGR+ 13], statistical dimension, which
is defined in terms of average correlation.

97
Definition 12.12 (Statistical dimension). Fix γ > 0, η > 0, and search problem Z over set of solutions F
and class of distributions D over X . We consider pairs (D, DD ) consisting of a “reference distribution” D
over X and a finite set of distributions DD ⊆ D with the following property: for any solution f ∈ F, the
set Df = DD \ Z −1 (f ) has size at least (1 − η) · |DD |. Let `(D, DD ) be the largest integer ` so that for
any subset D0 ⊆ Df with |D0 | ≥ |Df |/`, the average correlation is |ρ(D0 , D)| < γ (if there is no such ` one
can take ` = 0). The statistical dimension with average correlation γ and solution set bound η is defined to
be the largest `(D, DD ) for valid pairs (D, DD ) as described, and is denoted by SDA(Z, γ, η).

In [FGR+ 13], it is shown that statistical dimension immediately yields a lower bound on the number of
queries to an unbiased oracle or a V ST AT oracle needed to solve a given distributional search problem.

Theorem 12.13 (Theorems 2.7 and 3.17 of [FGR+ 13]). Let X be a domain and Z a search problem over a
set of solutions F and a class of distributions D over X . For γ > 0 and η ∈ (0, 1), let ` = SDA(Z, γ, η).
Any (possibly randomized) statistical query algorithm that solves Z with probability δ > η requires at least
` calls to the V ST AT (1/(3γ)) oracle to solve Z.
Moreover,
n any statistical
o query algorithm requires at least m calls to the Unbiased Oracle for m =
`(δ−η) (δ−η)2
min 2(1−η) , 12γ . In particular, if η ≤ 1/6, then any algorithm with success probability at least 2/3
requires at least min{`/4, 1/48γ} samples from the Unbiased Oracle.

We remark that the number of queries to an oracle is a lower bound on the runtime of the statistical
algorithm in question. Furthermore, the number of “samples” m corresponding to a V ST AT (t) oracle is t,
as this is the number needed to approximately obtain the confidence interval of width 2τ in the definition of
the V ST AT oracle above.

SQ Lower Bounds for Distributional k- PC. We now will use the theorem above to deduce SQ lower
bounds for distributional k- PC. Let S be the set of all k-subsets of [n] respecting the partition E i.e. S =
{S : |S| = k and |S ∩ Ei | = 1 for i ∈ [k]}. Note that |S| = (n/k)k . We henceforth use D to denote the
uniform distribution on {0, 1}n . The following lemma is as in [FGR+ 13], except that we further restrict S
and T to be in S rather than arbitrary size k subsets of [n], which does not change the bound.

Lemma 12.14 (Lemma 5.1 in [FGR+ 13]). For S, T ∈ S, χD (DS , DT ) = |hD b T iD | ≤ 2|S∩T | k 2 /n2 .
bS, D

The following lemma is crucial to deriving the SQ dimension of distributional k- PC and is similar to
Lemma 5.2 in [FGR+ 13]. Its proof is deferred to Appendix B.1.

Lemma 12.15 (Modification of Lemma 5.2 in [FGR+ 13]). Let δ ≥ 1/ log n and k ≤ n1/2−δ . For any
integer ` ≤ k, S ∈ S, and set A ⊆ S with |A| ≥ 2|S|/n2`δ ,

1 X b b k2
hDS , DT iD ≤ 2`+3 2 .

|A| n
T ∈A

This lemma now implies the following SQ dimension lower bound for distributional k- PC.

Theorem 12.16 (Analogue of Theorem 5.3 of [FGR+ 13]). For δ ≥ 1/ log n and k ≤ n1/2−δ let Z denote
−k
the distributional bipartite k-PC problem. If ` ≤ k then SDA(Z, 2`+3 k 2 /n2 , nk ) ≥ n2`δ /8.
k −k 
Proof. For each clique position S let DS = D \ {DS }. Then |DS | = nk − 1 = 1 − nk |D|. Now
0 0 0
for any D with |D | ≥ 2|S|/n we can apply Lemma 12.15 to conclude that ρ(D , D) ≤ 2 k 2 /n2 . By
2`δ `+3

Definition 12.12 of statistical dimension this implies the bound stated in the theorem.

98
Applying Theorem 12.13 to this statistical dimension lower bound yields the following hardness for
statistical query algorithms.

Corollary 12.17 (SQ Lower Bound for Recovery in Distributional k-PC). For any constant δ > 0 and k ≤
n1/2−δ , any SQ algorithm that solves the distributional bipartite k-PC problem requires Ω(n2 /k 2 log n) =
Ω̃(n1+2δ ) queries to the Unbiased Oracle.

This is to be interpreted as impossible, as there are only n right vertices vertices available in the actual
bipartite graph. Because all the quantities in Theorem 12.16 are the same as in [FGR+ 13] up to constants,
the same logic as used there allows to deduce a statement regarding the hypothesis testing version, stated
there as Theorems 2.9 and 2.10.

Corollary 12.18 (SQ Lower Bound for Decision Variant of Distributional k-PC). For any constant δ > 0,
suppose k ≤ n1/2−δ . Let D = Unif({0, 1}n ) and let D be the set of all planted bipartite k-PC distributions
(one for each clique position). Any SQ algorithm that solves the hypothesis testing problem between D and
D with probability better than 2/3 requires Ω(n2 /k 2 ) queries to the Unbiased Oracle.
A similar statement holds for VSTAT. There is a t = nΩ(log n) such that any randomized SQ algorithm
that solves the hypothesis testing problem between D and D with probability better than 2/3 requires at
least t queries to V ST AT (n2−δ /k 2 ).

We conclude this section by outlining how to extend these lower bounds to distributional versions of
k- BPC and BPC and why the statistical query model is not suitable to deduce hardness of problems that are
implicitly tensor or hypergraph problems such as k- HPC.

Extending these SQ Lower Bounds. Extending to the bipartite case is straightforward and follows by
replacing the probability of including each right vertex from k/n to km /m where km = O(m1/2−δ ).
This causes the upper bound in Lemma 12.14 to become χD (DS , DT ) = |hD bS, Db T iD | ≤ 2|S∩T | k 2 /m2 .
m
Similarly, the upper bound in Lemma 12.15 becomes 2`+3 km 2 /m2 , the relevant statistical dimension be-

comes SDA(Z, 2`+3 km 2 /n2 , n −k ) ≥ n2`δ /8 and the query lower bound in the final corollary becomes

m k
Ω(m2 /km 2 log n) = Ω̃(m1+2δ ) which yields the desired lower bound for k- BPDS . The lower bound for

BPDS follows by the same extension to the ordinary PC lower bound in [FGR+ 13].

Hypergraph PC and SQ Lower Bounds. A key component of formulating SQ lower bounds is devising
a distributional version of the problem with analogous limits in the SQ model. While there was a natural
bipartite extension for PC, for hypergraph PC, such an extension does not seem to exist. Treating slices as
individual samples yields a problem with statistical query algorithms that can detect a planted clique outside
of polynomial time. Consider the function that given a slice, searches for a clique of size k in the induced
(s − 1)-uniform hypergraph on the neighbors of the vertex corresponding to the slice, outputting 1 if such
a clique is found. Without a planted clique, the probability a slice contains such a clique is exponentially
small, while it is k/n if there is a planted clique. An alternative is to consider individual entries as samples,
but this discards the hypergraph structure of the problem entirely.

13 Robustness, Negative Sparse PCA and Supervised Problems


In this section, we apply reductions in Part II to deduce computational lower bounds for robust sparse mean
estimation, negative sparse PCA, mixtures of SLRs and robust SLR that follow from specific instantiations
of the PCρ conjecture. Specifically, we apply the reduction k- BPDS - TO - ISGM to deduce a lower bound
for RSME, the reduction BPDS - TO - NEG - SPCA to deduce a lower bound for NEG - SPCA and the reduction

99
k- BPDS - TO - MSLR to deduce lower bounds for MSLR, USLR and RSLR. This section is primarily devoted
to summarizing the implications of these reductions and making explicit how their input parameters need
to be set to deduce our lower bounds. The implications of these lower bounds and the relation between
them and algorithms was previously discussed in Section 3. In cases where the discussion in Section 3
was not exhaustive, such as the details of starting with different hardness assumptions, the number theoretic
condition ( T ) or the adversary implied by our reductions for RSLR, we include omitted details in this section.
All lower bounds that will be shown in this section are computational lower bounds in the sense intro-
duced in the beginning of Section 3. To deduce our computational lower bounds from reductions, it suffices
to verify the three criteria in Condition 6.1. We remark that this section is technical due to the number-
theoretic constraints imposed by the prime number r in our reductions. However, these technical details are
tangential to the primary focus of the paper, which is reduction techniques.

13.1 Robust Sparse Mean Estimation


We first observe that the instances of ISGM output by the reduction k- BPDS - TO - ISGM are instances of RSME
in Huber’s contamination model. Let r be a prime number and  ≥ 1/r. It then follows that a sample from
ISGM D (n, k, d, µ, 1/r) is of the form
⊗n
where DO = MIX−1 r−1 N (µ · 1S , Id ), N (µ0 · 1S , Id )

MIX  (N (µ · 1S , Id ), DO )

for some possibly random S with |S| = k and where (1 − r−1 )µ + r−1 · µ0 = 0. Note that this is a
distribution in the composite hypothesis H1 of RSME(n, √ k, d, τ, ) in Huber’s contamination model with
outlier distribution DO and where τ = kµ · 1S k2 = µ k. This observation and the discussion in Section
6.2 yields that it suffices to exhibit a reduction to ISGM to show the lower bound for RSME in Theorem 3.1.
We now discuss the condition ( T ) and the number-theoretic constraint arising from applying Theorem
10.2 to prove Theorem 3.1. As mentioned in Section 3.1, while this condition does not restrict our compu-
tational lower bound for RSME in the main regime of interest where −1 = no(1) , it also can be removed
using the design matrices Rn, in place of Kr,t . Despite this, we introduce the condition ( T ) in this section
as it will be a necessary condition in subsequent lower bounds in Part III.
As discussed in Section 10, the prime√ power rt in k- BPDS - TO - ISGM is intended to be a fairly close

approximation to each of kn , n and N . We will now see that in order to show tight computational lower
bounds for RSME, this approximation needs to be very close to asymptotically exact, leading to the technical
condition ( T ). First note that the level of signal µ produced by the reduction k- BPDS - TO - ISGM is
δ 1  
µ≤ p ·p = Θ̃ r−(t+1)/2
2 6 log(kn mrt ) + 2 log(p − q)−1 rt (r − 1)(1 + (r − 1)−1 )
where δ = Θ(1) and√the estimate above holds whenever p and q are constants. Therefore the corresponding
τ is given by τ = µ k = Õ(k 1/2 r−(t+1)/2 ). Furthermore, in Theorem 10.2, the output number of samples
N is constrained to satisfy that N = o(kn rt ) and n = O(kn rt ). Combining this with the fact that in order to

be starting with a hard k- BPDS instance, we need kn = o( n) to hold, it is straightforward to see that these
constraints together require that N = o(r2t ). If this is close to tight with N = Θ̃(r2t ), the computational
lower bound condition on τ becomes
   
τ = Õ k 1/2 r−(t+1)/2 = Θ̃ k 1/2 1/2 N −1/4

where we also use the fact that  = Θ(1/r). Note that this corresponds exactly to the desired computational
lower bound of N = õ(k 2 2 /τ 4 ). Furthermore, if instead N = Θ̃(a−1 r2t ) for some a = ω(1), then the
lower bound we show degrades to N = õ(k 2 2 /aτ 4 ), and is suboptimal by a factor a = ω(1). Thus ideally
we would like the pair of parameters (N, r) to be such that there infinitely many N with something like
N = Θ̃(r2t ) true for some positive integer t ∈ N. This leads exactly to the condition ( T ) below.

100
Definition 13.1 (Condition ( T )). Suppose that (N, r) is a pair of parameters with N ∈ N and r = r(N )
is non-decreasing. The pair (N, r) satisfies ( T ) if either r = N o(1) as N → ∞ or if r = Θ̃(N 1/t ) where
t ∈ N is a constant even integer.
The key property arising from condition ( T ) is captured in the following lemma.
Lemma 13.2 (Property of ( T )). Suppose that (N, r) satisfies ( T ) and let r0 = r0 (N ) be any non-decreasing
positive integer parameter satisfying that r0 = Θ̃(r).
√ Then there are infinitely many values of N with the
following property: there exists s ∈ N such that N = Θ̃ ((r0 )s ).
Proof. If r = Θ̃(N 1/t ) where t ∈ N is a constant even integer, then this property is satisfied trivially by
taking s = t/2. Now suppose that r = N o(1) and note that this also implies that r0 = N o(1) . Now consider
the function
log N
f (N ) =
2 log r0 (N )
Since r0 = N o(1) , it follows that f (N ) → ∞ as N → ∞. Suppose that N is sufficiently large so that
f (N ) > 1. Note that, for each N , either r0 (N +1) ≥ r0 (N )+1 or r0 (N +1) = r0 (N ). If r0 (N +1) = r0 (N ),
then f (N + 1) > f (N ). If r0 (N + 1) ≥ r0 (N ) + 1, then
f (N + 1) g(N ) log(x + 1)
≤ where g(x) =
f (N ) g(r0 (N )) log x
Note that g(x) is a decreasing function of x for x ≥ 2. Since f (N ) > 1, it follows that r0 (N ) < N and
hence the above inequality implies that f (N + 1) < f (N ). Summarizing these observations, every time
f (N ) increases it must follow that r0 (N +1) = r0 (N ). Fix a sufficiently large positive integer s and consider
the first N for which f (N ) ≥ s. It follows by our observation that r0 (N ) = r0 (N − 1) and furthermore that
f (N − 1) < s. This implies that N − 1 < r0 (N )2s and N ≥ r0 (N )2s . Since r0 (N ) is a positive integer,
it then must follow that N = r0 (N )2s . Since such an N exists for every sufficiently large s, this completes
the proof of the lemma.

This condition ( T ) will arise in a number of others problems that we map to, including robust SLR and
dense stochastic block models, for a nearly identical reason. We now formally prove Theorem 3.1. All
remaining proofs in this section will be of a similar flavor and where details are similar, we only sketch them
to avoid redundancy.

Theorem 3.1 (Lower Bounds for RSME). If k, d and n are polynomial in each other, k = o( d) and
 < 1/2 is such that (n, −1 ) satisfies ( T ), then the k- BPC conjecture or k- BPDS conjecture for constant
0 < q < p ≤ 1 both imply that there is a computational lower bound for RSME(n, k, d, τ, ) at all sample
complexities n = õ(k 2 2 /τ 4 ).
Proof. To prove this theorem, we will to show that Theorem 10.2 implies that k- BPDS - TO - ISGM fills out
all of the possible growth rates specified by the computational lower bound n = õ(k 2 2 /τ 4 ) and the other
conditions in the theorem statement. As discussed earlier
√ in this section, it suffices to reduce in total variation
to ISGM(n, k, d, µ, 1/r) where 1/r ≤  and µ = τ / k.
Fix a constant pair of probabilities 0 < q < p ≤ 1 and any sequence of parameters (n, k, d, τ, ) all of
which are implicitly functions of n such that (n, −1 ) satisfies ( T ) and (n, k, d, τ, ) satisfy the conditions
k 2 2
n≤c· and wk 2 ≤ d
τ 4 · (log n)2+2c0
for sufficiently large n, an arbitrarily slow-growing function w = w(n) → ∞ at least satisfying that
w(n) = no(1) , a sufficiently small constant c > 0 and a sufficiently large constant c0 > 0. In order to
fulfill the criteria in Condition 6.1, we now will specify:

101
1. a sequence of parameters (M, N, kM , kN , p, q) such that the k- BPDS instance with these parameters
is hard according to Conjecture 2.3; and

2. a sequence of parameters (n0 , k, d, τ, ) with a subsequence that satisfies three conditions: (2.1) the
parameters on the subsequence are in the regime of the desired computational lower bound for RSME;
(2.2) they have the same growth rate as (n, k, d, τ, ) on this subsequence; and (2.3) such that RSME
with the parameters on this subsequence can be produced by the reduction k- BPDS - TO - ISGM with
input k- BPDS(M, N, kM , kN , p, q).

By the discussion in Section 6.2, this would be sufficient to show the desired computational lower bound.
We choose these parameters as follows:

• let r be a prime with r ≥ −1 and r ≤ 2−1 , which exists by Bertrand’s postulate and can be found in
poly(−1 ) ≤ poly(n) time;
√ √
• let t be such that rt is the closest power of r to n, let n0 = bw−2 r2t c, let kN = b n0 c and let
N = wkN 2 ≤ k r t ; and
N

• set µ = τ / k, kM = k and M = wk 2 .

The given inequality and parameter settings above rearrange to the following condition on n0 :
 2t
k 2 2

0 −2 2t r
n ≤w r =O ·
n τ 4 · (log n)2+2c0

Furthermore, the given inequality yields the constraint on µ that


!
−1/2 c1/4 1/2 rt/2 1
µ=τ ·k ≤ 1/4 =Θ ·p
n (log n)(1+c0 )/2 n 1/4
r (log n)1+c0
t+1


As long as n = Θ̃(rt ) then: (2.1) the inequality above on n0 would imply that (n0 , k, d, τ, ) is in the
desired hard regime; (2.2) n and n0 have the same growth rate since w = no(1) ; and (2.3) taking c0 large
enough would imply that µ satisfies the conditions needed to apply Theorem 10.2 to yield the desired

reduction. By Lemma 13.2, there is an infinite subsequence of the input parameters such that n = Θ̃(rt ).
This verifies the three criteria in Condition 6.1. Following the argument in Section 6.2, Lemma 6.1 now
implies the theorem.

As alluded to in Section 3.1, replacing Kr,t with Rn, in the applications of dense Bernoulli rotations in
k- BPDS - TO - ISGM removes condition (T) from this lower bound. Specifically, applying k- BPDS - TO - ISGMR
and Corollary 10.5 in place of k- BPDS - TO - ISGM and replacing the dimension rt with L in the argument
above yields the lower bound shown below. Note that condition (T) in Theorem 3.1 is replaced by the looser
requirement that  = Ω̃(n−1/2 ). As discussed at the end of Section 10.1, this requirement arises from the
condition   L−1 log L in Corollary 10.5. We remark that the condition  = Ω̃(n−1/2 ) is implicit in (T)
and hence the following corollary is strictly stronger than Theorem 3.1.

√ 13.3 (Lower Bounds for RSME without Condition (T)). If k, d and n are polynomial in each other,
Corollary
k = o( d) and  < 1/2 is such that  = Ω̃(n−1/2 ), then the k- BPC conjecture or k- BPDS conjecture for
constant 0 < q < p ≤ 1 both imply that there is a computational lower bound for RSME(n, k, d, τ, ) at all
sample complexities n = õ(k 2 2 /τ 4 ).

102
We remark that only assuming the k- PC conjecture also yields hardness for RSME. In particular k- PC
can be mapped to the asymmetric bipartite case by considering the bipartite subgraph with k/2 parts on
one size and k/2 on the other. Showing hardness for RSME from k- PC then reduces to the hardness yielded
by k- BPC with M = N . Examining this restricted setting in the theorem above and passing through an
analogous argument yields a computational lower bound at the slightly suboptimal rate

n = õ k 2 /τ 2 as long as τ 2 log n = o()




When (log n)−O(1) .  . 1/ log n, then the optimal k-to-k 2 gap is recovered up to polylog(n) factors by
this result.

13.2 Negative Sparse PCA


In this section, we deduce Theorem 3.5 on the hardness of NEG - SPCA using the reduction BPDS - TO - NEG - SPCA
and Theorem 9.5. Because this reduction does not bear the number-theoretic considerations of the reduction
to RSME, this proof will be substantially more straightforward.

Theorem 3.5 (Lower Bounds for NEG - SPCA). If k, d and n are polynomial in each other, k = o( d) and
k = o(n1/6 ), then the BPC or BPDS conjecture for constant 0 < q < p ≤ 1 both imply p conjecture implies a
computational lower bound for NEG - SPCA(n, k, d, θ) at all levels of signal θ = õ( k 2 /n).

Proof. We show that Theorem 9.5 implies that BPDS - TOp - NEG - SPCA fills out all of the possible growth
rates specified by the computational lower bound θ = õ( k 2 /n) and the other conditions in the theorem
statement. Fix a constant pair of probabilities 0 < q < p ≤ 1 and a sequence of parameters (n, k, d, θ) all
of which are implicitly functions of n such that
s
k2
θ ≤ cw−1 · , wk ≤ n1/6 and wk 2 ≤ d
n(log n)2

for sufficiently large n, an arbitrarily slow-growing function w = w(n) → ∞ where w(n) = no(1) and
a sufficiently small constant c > 0. In order to fulfill the criteria in Condition 6.1, we now will specify:
a sequence of parameters (M, N, kM , kN , p, q) such that the BPDS instance with these parameters is hard
according to Conjecture 2.3, and such that NEG - SPCA with the parameters (n, k, d, θ) can be produced by
the reduction BPDS - TO - NEG - SPCA applied to BPDS(M, N, kM , kN , p, q). These parameters along with the
internal parameter τ of the reduction can be chosen as follows:

• let N = n, kN = w−1 n, kM = k and M = wk 2 ; and

• let τ > 0 be such that


4nθ
τ2 =
kN k(1 − θ)

It is straightforward to verify that the inequality above upper bounding θ implies that τ ≤ 4c/ log n and
thus satisfies the condition on τ needed to apply Lemma 9.1 and Theorem 9.5 for a sufficiently small c > 0.
Furthermore, this setting of τ yields
τ 2 kN k
θ=
4n + τ 2 kN k
Furthermore, note that d ≥ M and n  M 3 by construction. Applying Theorem 9.5 now verifies the
desired property above. This verifies the criteria in Condition 6.1 and, following the argument in Section
6.2, Lemma 6.1 now implies the theorem.

103
We remark the the constraint k = o(n1/6 ), as mentioned in Section 3.5, is a technical condition that
we believe should not be necessary for the theorem to hold. This is similar to the constraint arising in the
strong reduction to sparse PCA given by C LIQUE - TO -W ISHART in [BB19b]. In C LIQUE - TO -W ISHART,
the random matrix comparison between Wishart and GOE produced the technical condition that k = o(n1/6 )
in a similar manner to how our comparison result between Wishart and inverse Wishart produces the same
constraint here. We also remark that the reduction C LIQUE - TO -W ISHART can be used here to yield the
same hardness for NEG - SPCA as in Theorem 3.5 based only on the PC conjecture. This is achieved by the
reduction that maps from PC to sparse PCA with d = wk 2 as a first step using C LIQUE - TO -W ISHART and
then uses the second step of BPDS - TO - NEG - SPCA to map to NEG - SPCA.

13.3 Mixtures of Sparse Linear Regressions and Robustness


In this section, we deduce Theorems 13.4, 3.6 and 3.7 on the hardness of unsigned, mixtures of and ro-
bust sparse linear regression, all using the reduction k- BPDS - TO - MSLR with different parameters (r, ) and
Theorem 10.8. We begin by showing bounds for USLR(n, k, d, τ ).
We first make the following simple but important observation. Note that a single sample from USLR is of
the form y = |τ ·hvS , Xi+N (0, 1)|, which has the same distribution as |y 0 | where y 0 = τ r·hvS , Xi+N (0, 1)
and r is an independent Rademacher random variable. Note that y 0 is a sample from MSLRD (n, k, d, γ, 1/2)
with γ = τ . Thus to show a computational lower bound for USLR(n, k, d, τ ), it suffices to show a lower
bound for MSLR(n, k, d, τ ).

Theorem 13.4 (Lower Bounds for USLR). If k, d and n are polynomial in each other, k = o( d) and
k = o(n1/6 ), then the k- BPC or k- BPDS conjecture for constant 0 < q < p ≤ 1 both imply that there is a
computational lower bound for USLR(n, k, d, τ ) at all sample complexities n = õ(k 2 /τ 4 ).

Proof. To prove this theorem, we will show that Theorem 10.8 implies that k- BPDS - TO - MSLR applied with
r = 2 fills out all of the possible growth rates specified by the computational lower bound n = õ(k 2 /τ 4 ) and
the other conditions in the theorem statement. As mentioned above, it suffices to reduce in total variation
to MSLR(n, k, d, τ ). Fix a constant pair of probabilities 0 < q < p ≤ 1 and any sequence of parameters
(n, k, d, τ ) all of which are implicitly functions of n with

k2
n≤c· , wk ≤ n1/6 and wk 2 ≤ d
w2 · τ 4 · (log n)4

for sufficiently large n, an arbitrarily slow-growing function w = w(n) → ∞ and a sufficiently small
constant c > 0. In order to fulfill the criteria in Condition 6.1, we now will specify: a sequence of pa-
rameters (M, N, kM , kN , p, q) such that the k- BPDS instance with these parameters is hard according to
Conjecture 2.3, and such that MSLR with the parameters (n, k, d, τ, 1/2) can be produced by the reduction
k- BPDS - TO - MSLR applied with r = 2 to BPDS(M, N, kM , kN , p, q). By the discussion in Section 6.2, this
would be sufficient to show the desired computational lower bound. We choose these parameters as follows:
√ √
• let t be such that 2t is the smallest power of two greater than w n, let kN = b nc and let N =
wkN 2 ≤ k 2t ; and
N

• set kM = k and M = wk 2 .

Now note that τ 2 is upper bounded by

c1/2 · k
 
2 kN kM
τ ≤ =O
wn1/2 · (log n)2 N log(M N )

104
Furthermore, we have that

c1/2 · k
 
2 kM
τ ≤ =Θ
wn1/2 · (log n)2 2 log(kN M · 2t ) log n
t+1

Therefore τ satisfies the conditions needed to apply Theorem 10.8 for a sufficiently small c > 0. Also
note that n  M 3 and d ≥ M by construction. Applying Theorem 10.8 now verifies the desired property
above. This verifies the criteria in Condition 6.1 and, following the argument in Section 6.2, Lemma 6.1
now implies the theorem.

The proof of the theorem above also directly implies Theorem 3.6. This yields our main computational
lower bounds for MSLR, which are stated below.

Theorem 3.6 (Lower Bounds for MSLR). If k, d and n are polynomial in each other, k = o( d) and
k = o(n1/6 ), then the k- BPC or k- BPDS conjecture for constant 0 < q < p ≤ 1 both imply that there is a
computational lower bound for MSLR(n, k, d, τ ) at all sample complexities n = õ(k 2 /τ 4 ).

Now observe that the instances of MSLR output by the reduction k- BPDS - TO - MSLR applied with r > 2
are instances of RSLR in Huber’s contamination model. Let r be a prime number and  ≥ 1/r. Also
let X ∼ N (0, Id ) and y = τ · hvS , Xi + η where η ∼ N (0, 1) where |S| = k. By Definition 10.7,
MSLR D (n, k, d, τ, 1/r) is of the form

⊗n
where DO = MIX−1 r−1 L(X, y), L0

MIX  (L(X, y), DO )

for some possibly random S with |S| = k and where L0 denotes the distribution on pairs (X, y) that are
jointly Gaussian with mean zero and (d + 1) × (d + 1) covariance matrix
 " 2 −1)γ 2
#
Id + (a1+γ > −aγ · v

ΣXX ΣXy 2 · v v
S S S
=
ΣyX Σyy −aγ · vS> 1 + γ2

This yields a very particular construction of an adversary in Huber’s contamination model, which we show
in the next theorem yields a computational lower bound for RSLR. With the observations above, the proof
of this theorem is similar to that of Theorem 3.1 and is deferred to Appendix B.2.

Theorem 13.5 (Lower Bounds for RSLR with Condition √ (T)). If k, d and n are polynomial in each other,
 < 1/2 is such that (n, −1 ) satisfies ( T ), k = o( d) and k = o(n1/6 ), then the k- BPC conjecture or
k- BPDS conjecture for constant 0 < q < p ≤ 1 both imply that there is a computational lower bound for
RSLR (n, k, d, τ, ) at all sample complexities n = õ(k 2 2 /τ 4 ).

Our main computational lower bound for RSLR follows from the same argument applied to the reduction
k- BPDS - TO - MSLRR instead of k- BPDS - TO - MSLR and using Corollary 10.13 instead of Theorem 10.8. As
in Corollary 13.3, this replaces condition (T) with the weaker condition that  = Ω̃(n−1/2 ).

Theorem 3.7 (Lower Bounds


√ for RSLR). If k, d and n are polynomial in each other,  < 1/2 is such that
 = Ω̃(n−1/2 ), k = o( d) and k = o(n1/6 ), then the k- BPC conjecture or k- BPDS conjecture for constant
0 < q < p ≤ 1 both imply that there is a computational lower bound for RSLR(n, k, d, τ, ) at all sample
complexities n = õ(k 2 2 /τ 4 ).

105
14 Community Recovery and Partition Models
In this section, we devise several reductions based on B ERN -ROTATIONS and T ENSOR -B ERN -ROTATIONS
using the design matrices and tensors from Section 8 to reduce from k- PC, k- PDS, k- BPC and k- BPDS to
dense stochastic block models, hidden partition models and semirandom planted dense subgraph. These
reductions are briefly outlined in Section 4.3.
Furthermore, the heuristic presented at the end of Section 4.3 predicts the computational barriers for the
problems in this section. The `2 norm of the matrix E[X] corresponding to a k- PC instance is Θ(k), which

is just below Θ̃( n) when this k- PC is near its computational barrier. Furthermore, it can be verified that
the `2 norm of the matrices E[X] corresponding to the problems in this section are:
• If γ = P11 − P0 in the ISBM notation of Section 3.2, then a direct calculation yields that the `2 norm
corresponding to ISBM is Θ(nγ/k).

• In GHPM and BHPM, the corresponding `2 norm can be verified to be Θ(Kγ r).
• In our adversarial construction for SEMI - CR, the corresponding `2 norm is Θ(kγ) where γ = P1 − P0 .

Following the heuristic, setting these equal to Θ̃( n) yields the predicted computational barriers of γ 2 =
Θ̃(k 2 /n) in ISBM, γ 2 = Θ̃(n/rK 2 ) in GHPM and BHPM and γ 2 = Θ̃(n/k 2 ) in SEMI - CR. We now present
our reduction to ISBM.

14.1 Dense Stochastic Block Models with Two Communities


We begin by recalling the definition of the imbalanced 2-block stochastic block model from Section 3.2.
Definition 14.1 (Imbalanced 2-Block Stochastic Block Model). Let k and n be positive integers such that k
divides n. The distribution ISBMD (n, k, P11 , P12 , P22 ) over n-vertex graphs G is sampled by first choosing
an (n/k)-subset C ⊆ [n] uniformly at random and sampling the edges of G independently with the following
probabilities 
 P11 if i, j ∈ C
P [{i, j} ∈ E(G)] = P if exactly one of i, j is in C
 12
P22 if i, j ∈ [n]\C
Given a subset C ⊆ [n] of size n/k, we let ISBMD (n, C, P11 , P12 , P22 ) denote ISBM as defined above
conditioned on the latent subset C. As discussed in Section 6.3, this naturally leads to a composite hypoth-
esis testing problem between
H0 : G ∼ G (n, P0 ) and H1 : G ∼ ISBMD (n, k, P11 , P12 , P22 )
where P0 is any edge density in (0, 1). This section is devoted to showing reductions from k- PDS and k- PC
to ISBM formulated as this hypothesis testing problem. In particular, we will focus on P11 , P12 , P22 and P0
all of which are bounded away from 0 and 1 by a constant, and which satisfy that
   
1 1 1 1
P0 = · P11 + 1 − P12 = · P12 + 1 − P22 (11)
k k k k
These two constraints allow P11 , P12 , P22 to be reparameterized in terms of a signal parameter γ as
γ γ
P11 = P0 + γ, P12 = P0 − and P22 = P0 + (12)
k−1 (k − 1)2
There are two main reasons why we restrict to the parameter regime enforced by the density constraints
in (11) – it creates a model with nearly uniform expected degrees and which is a mean-field analogue of
recovering the first community in the k-block stochastic block model.

106
• Nearly Uniform Expected Degrees: Observe that, conditioned on C, the expected degree of a vertex
i ∈ [n] in ISBM(n, k, P11 , P12 , P22 ) is given by
n(k−1)
( n 
k −1 ·P 11 + k · P12 if i ∈ C
E [deg(i)|C] = n n(k−1)
k · P12 + k − 1 · P22 if i ∈ [n]\C

Thus the density constraints in (11) ensure that these differ by at most 1 from each other and from
(n − 1)P0 . Thus all of the vertices in ISBM(n, k, P11 , P12 , P22 ) and the H0 model G (n, P0 ) have
approximately the same expected degree. This precludes simple degree and total edge thresholding
tests that are optimal in models of single community detection that are not degree-corrected. As
discussed in Section 3.4, the planted dense subgraph model has a detection threshold that differs from
the conjectured Kesten-Stigum threshold for recovery of the planted dense subgraph. Thus to obtain
computational lower bounds for a hypothesis testing problem that give tight recovery lower bounds,
calibrating degrees is crucial. The main result of this section can be viewed as showing approximate
degree correction is sufficient to obtain the Kesten-Stigum threshold for ISBM through a reduction
from k- PDS and k- PC.

• Mean-Field Analogue of First Community Recovery in k- SBM: As discussed in Section 3.2, the imbal-
anced 2-block stochastic block model ISBMD (n, k, P11 , P12 , P22 ) is roughly a mean-field analogue of
recovering the first community C1 in a k-block stochastic block model. More precisely, consider a
graph G wherein the vertex set [n] is partitioned into k latent communities C1 , C2 , . . . , Ck each of size
n/k and edges are then included in the graph G independently such that intra-community edges ap-
pear with probability p while inter-community edges appear with probability q < p. The distribution
ISBM D (n, k, P11 , P12 , P22 ) can be viewed as a mean-field analogue of recovering a first community
C = C1 in the k-block model, when
 
1 1
P11 = p, P12 = q and P22 = ·p+ 1− q
k−1 k−1
Here, P22 approximately corresponds to the average edge density on the subgraph of the k-block
model restricted to [n]\C1 . This analogy between ISBM and k- SBM is also why we choose to param-
eterize ISBM in terms of k rather than the size n/k of C.

As discussed in Section 3.2, if k = o( n), the conjectured recovery threshold for efficient recovery in
k- SBM is the Kesten-Stigum threshold of

(p − q)2 k2
&
q(1 − q) n

while the statistically optimal rate of recovery is when this level of signal is instead Ω̃(k 4 /n2 ). Furthermore,
the information-theoretic threshold and conjectured computational barrier are the same for ISBM in the
regime defined by (11). Parameterizing ISBM in terms of γ as in (12), the Kesten-Stigum threshold can be
expressed as γ 2 = Ω̃(k 2 /n). The objective of this section is give a reduction from k- PDS to ISBM in the
dense regime with min{P0 , 1 − P0 } = Ω(1) up to the Kesten-Stigum threshold.
The first reduction of this section k- PDS - TO - ISBM is shown in Figure 12 and maps to the case where
P0 = 1/2 and (12) is only approximately true. In a subsequent corollary, a simple modification of this
reduction will map to all P0 with min{P0 , 1 − P0 } = Ω(1) and show (12) holds exactly. The following
theorem establishes the approximate Markov transition properties of k- PDS - TO - ISBM. TheR proof of this
x 2
theorem follows a similar structure to the proof of Theorem 10.2. Recall that Φ(x) = √12π −∞ e−x /2 dx
denotes the standard normal CDF.

107
Theorem 14.2 (Reduction to ISBM). Let N be a parameter and r = r(N ) ≥ 2 be a prime number. Fix
initial and target parameters as follows:

• Initial k- BPDS Parameters: vertex count N , subgraph size k = o(N ) dividing N , edge probabilities
0 < q < p ≤ 1 with min{q, 1 − q} = Ω(1) and p − q ≥ N −O(1) , and a partition E of [N ].
−1 t
• Target ISBM Parameters: (n, r) where ` = rr−1 and n = kr` for some parameter t = t(N ) ∈ N
satisfying that that
m ≤ krt ≤ kr` ≤ poly(N )
 
p
where m is the smallest multiple of k larger than Q + 1 N and where
p √
Q=1− (1 − p)(1 − q) + 1{p=1} ( q − 1)

• Target ISBM Edge Strengths: (P11 , P12 , P22 ) given by

µ(r − 1)2
   
µ(r − 1)  µ 
P11 = Φ , P12 = Φ − t+1 and P22 = Φ
rt+1 r rt+1

where µ ∈ (0, 1) satisfies that


    
1 p 1−Q
µ≤ p · min log , log
2 6 log(kr`) + 2 log(p − Q)−1 Q 1−p

Let A(G) denote k- PDS - TO - ISBM applied to the graph G with these parameters. Then A runs in poly(N )
time and it follows that
 
k −Ω(N 2 /km) −1
dTV (A (GE (N, k, p, q)) , ISBMD (n, r, P11 , P12 , P22 )) = O √ +e + (kr`)
N
 2

dTV (A (G(N, q)) , G(n, 1/2)) = O e−Ω(N /km) + (kr`)−1

To prove this theorem, we begin by proving a lemma analyzing the dense Bernoulli rotations step of
k- PDS - TO - ISBM. Define vS,F 0 ,F 00 (M ) as in Section 10.1. The proof of the next lemma follows similar steps
to the proof of Lemma 10.3.

Lemma 14.3 (Bernoulli Rotations for ISBM). Let F 0 and F 00 be a fixed partitions of [krt ] and [kr`] into k
parts of size rt and r`, respectively, and let S ⊆ [krt ] where |S ∩ Fi0 | = 1 for each 1 ≤ i ≤ k. Let A3 denote
Step 3 of k- PDS - TO - ISBM with input MPD2 and output MR . Suppose that p, Q and µ are as in Theorem
14.2, then it follows that
 
dTV A3 M[krt ]×[krt ] (S × S, Bern(p), Bern(Q)) ,
 
µ(r − 1) > ⊗kr`×kr`
= O (kr`)−1

L · vS,F ,F (Kr,t )vS,F ,F (Kr,t ) + N (0, 1)
0 00 0 00
r
  t t
 
dTV A3 Bern(Q)⊗kr ×kr , N (0, 1)⊗kr`×kr` = O (kr`)−1


Proof. First consider the case where MPD2 ∼ M[krt ]×[krt ] (S × S, Bern(p), Bern(Q)). Observe that the
submatrices of MPD2 are distributed as follows

(MPD2 )Fi0 ,Fj0 ∼ PB Fi0 × Fj0 , (S ∩ Fi0 , S ∩ Fj0 ), p, Q




108
Algorithm k- PDS - TO - ISBM
Inputs: k- PDS instance G ∈ GN with dense subgraph size k that divides N , and the following parameters

• partition E of [N ] into k parts of size N/k, edge probabilities 0 < q < p ≤ 1


 
p
• let m be the smallest multiple of k larger than Q + 1 N where
p √
Q=1− (1 − p)(1 − q) + 1{p=1} ( q − 1)

rt −1
• output number of vertices n = kr` where r is a prime number r, ` = r−1 for some t ∈ N and

m ≤ krt ≤ kr` ≤ poly(N )

• mean parameter µ ∈ (0, 1) satisfying that


    
1 p 1−Q
µ≤ p · min log , log
2 6 log n + 2 log(p − Q)−1 Q 1−p

1. Symmetrize and Plant Diagonals: Compute MPD1 ∈ {0, 1}m×m with partition F of [m] as

MPD1 ← T O -k-PARTITE -S UBMATRIX(G)

applied with initial dimension N , partition E, edge probabilities p and q and target dimension m.
t t
2. Pad: Form MPD2 ∈ {0, 1}kr ×kr by embedding MPD1 as the upper left principal submatrix of
MPD2 and then adding krt −m new indices for columns and rows, with all missing entries sampled
i.i.d. from Bern(Q). Let Fi0 be Fi with rt −m/k of the new indices. Sample k random permutations
σi of Fi0 independently for each 1 ≤ i ≤ k and permute the indices of the rows and columns of
MPD2 within each part Fi0 according to σi .

3. Bernoulli Rotations: Let F 00 be a partition of [kr`] into k equally sized parts. Now compute the
matrix MR ∈ Rkr`×kr` as follows:

(1) For each i, j ∈ [k], apply T ENSOR -B ERN -ROTATIONS to the matrix (MPD2 )Fi0 ,Fj0 with matrix
parameter A1 = A2 = Kr,t , rejection kernel parameter
p RRK = kr`, Bernoulli probabilities
0 < Q < p ≤ 1, output dimension r`, λ1 = λ2 = 1 + (r − 1)−1 and mean parameter µ.
(2) Set the entries of (MR )Fi00 ,Fj00 to be the entries in order of the matrix output in (1).

4. Threshold and Output: Now construct the graph G0 with vertex set [kr`] such that for each i > j
with i, j ∈ [kr`], we have {i, j} ∈ E(G0 ) if and only if (MR )ij ≥ 0. Output G0 with randomly
permuted vertex labels.

Figure 12: Reduction from k-partite planted dense subgraph to the dense imbalanced 2-block stochastic block model.

109
and are independent. Combining upper bound on the singular values of Kr,t in Lemma 8.5 with Corollary
8.2 implies that
  
µ(r − 1) > ⊗r`×r`
= O r2t · (kr`)−3

dTV (MR )Fi00 ,Fj00 , L · (Kr,t )·,S∩Fi0 (Kr,t )·,S∩F 0 + N (0, 1)
r j

Since the submatrices (MR )Fi00 ,Fj00 are independent, the tensorization property of total variation in Fact 6.2
implies that dTV (MR , L(Z)) = O k 2 r2t · (kr`)−3 = O (kr`)−1 where the submatrices ZFi00 ,Fj00 are
 

independent and satisfy


 
µ(r − 1) > ⊗r`×r`
ZFi00 ,Fj00 ∼ L · (Kr,t )·,S∩Fi0 (Kr,t )·,S∩F 0 + N (0, 1)
r j

Note that the entries of Z are independent Gaussians each with variance 1 and Z has mean given by µ(1 +
r−1 ) · vS,F 0 ,F 00 (Kr,t )vS,F 0 ,F 00 (Kr,t )> , by the definition of vS,F 0 ,F 00 (Kr,t ). This proves the first total variation
t t
upper bound in the statement of the lemma. Now suppose that MPD2 ∼ Bern(Q)⊗kr ×kr . Corollary 8.2
implies that  
dTV (MR )Fi00 ,Fj00 , N (0, 1)⊗r`×r` = O r2t · (kr`)−3


for each 1 ≤ i, j ≤ k. Since the submatrices (MR )Fi00 ,Fj00 of MR are independent, it follows that
 
dTV MR , N (0, 1)⊗kr`×kr` = O k 2 r2t · (kr`)−3 = O (kr`)−1
 

by the tensorization property of total variation in Fact 6.2, completing the proof of the lemma.

The next lemma is immediate but makes explicit the precise guarantees for Step 4 of k- PDS - TO - ISBM.
Lemma 14.4 (Thresholding for ISBM). Let F 0 , F 00 , S and T be as in Lemma 14.3. Let A4 denote Step 4 of
k- PDS - TO - ISBM with input MR and output G0 . Then
 
µ(r − 1) > ⊗kr`×kr`
A4 · vS,F 0 ,F 00 (Kr,t )vS,F 0 ,F 00 (Kr,t ) + N (0, 1) ∼ ISBMD (kr`, r, P11 , P12 , P22 )
r
 
A4 N (0, 1)⊗kr`×kr` ∼ G(kr`, 1/2)

where P11 , P12 and P22 are as in Theorem 14.2.


Proof. Firstp
observe that, since Lemma 8.4 implies that eachp column of Kr,t contains exactly (r−1)` entries
t
equal to 1/ r (r − 1) and `p entries equal to (1 − r)/ rt (r − 1), it follows
p that vS,F ,F (Kr,t ) contains
0 00
t t
k(r − 1)` entries equal to 1/ r (r − 1) and k` entries equal to (1 − r)/ r (r − 1). Therefore there is a
subset T ⊆ [kr`] with |T | = k` such that the kr` × kr` mean matrix Z = vS,F 0 ,F 00 (Kr,t )vS,F 0 ,F 00 (Kr,t )>
has entries
2

1  (r − 1) if i, j ∈ S
Zij = t · −(r − 1) if i ∈ S and j 6∈ S or i 6∈ S and j ∈ S
r (r − 1) 
1 if i, j 6∈ S
Since the vertices of G0 are randomly permuted, it follows by definition now that if
 
µ(r − 1) > ⊗kr`×kr`
MR ∼ L · vS,F 0 ,F 00 (Kr,t )vS,F 0 ,F 00 (Kr,t ) + N (0, 1)
r
then G0 ∼ ISBMD (kr`, k`, P11 , P12 , P22 ), proving the first distributional equality in the lemma. The second
distributional equality follows from the fact that Φ(0) = 1/2.

110
We now complete the proof of Theorem 14.2 using a similar application of Lemma 6.3 as in the proof
of Theorem 10.2.

Proof of Theorem 14.2. We apply Lemma 6.3 to the steps Ai of A under each of H0 and H1 . Define the
steps of A to map inputs to outputs as follows
A A A A
1
(G, E) −−→ 2
(MPD1 , F ) −−→ (MPD2 , F 0 ) −−→
3
(MR , F 00 ) −→
4
G0

Under H1 , consider Lemma 6.3 applied to the following sequence of distributions

P0 = GE (N, k, p, q)
P1 = M[m]×[m] (S × S, Bern(p), Bern(Q)) where S ∼ Um (F )
P2 = M[krt ]×[krt ] (S × S, Bern(p), Bern(Q)) where S ∼ Ukrt (F 0 )
µ(r − 1)
P3 = · vS,F 0 ,F 00 (Kr,t )vS,F 0 ,F 00 (Kr,t )> + N (0, 1)⊗kr`×kr` where S ∼ Ukrt (F 0 )
r
P4 = ISBMD (kr`, r, P11 , P12 , P22 )

Applying Lemma 7.5, we can take


r
Q2 N 2 CQ k 2
 
1 = 4k · exp − +
48pkm 2m
n o
Q 1−Q
where CQ = max 1−Q , Q . The step A2 is exact and we can take 2 = 0. Applying Lemma 14.3 and
averaging over S ∼ Ukrt (F 0
 ) using the conditioning property of total variation in Fact 6.2 yields that we
can take 3 = O (kr`)−1 . By Lemma 14.4, Step 4 is exact and we can take 4 = 0. By Lemma 6.3, we
therefore have that
 
k 2
dTV (A (GE (N, k, p, q)) , ISBM(n, r, P11 , P12 , P22 )) = O √ + e−Ω(N /km) + (kr`)−1
N
which proves the desired result in the case of H1 . Under H0 , consider the distributions

P0 = G(N, q)
P1 = Bern(Q)⊗m×m
t ×kr t
P2 = Bern(Q)⊗kr
P3 = N (0, 1)⊗kr`×kr`
P4 = G(kr`, 1/2)

As above, Lemmas 7.5, 14.3 and 14.4 imply that we can take

Q2 N 2
 
, 2 = 0, 3 = O (kr`)−1

1 = 4k · exp − and 4 = 0
48pkm

By Lemma 6.3, we therefore have that


 2

dTV (A (G(N, q)) , G(n, 1/2)) = O e−Ω(N /kn) + (kr`)−1

which completes the proof of the theorem.

111
We now prove that a slight modification to this reduction will map to all P0 with min{P0 , 1−P0 } = Ω(1)
and to the setting where the density constraints in (12) hold exactly.

Corollary 14.5 (Reduction to Arbitrary P0 ). Let 0 < q < p ≤ 1 be constant and let N, r, k, E, ` and
n be as in Theorem 14.2 with the additional condition that kr3/2 = o(r2t ). Suppose that P0 satisfies
min{P0 , 1 − P0 } = Ω(1) and γ ∈ (0, 1) satisfies that
c
γ≤ p
rt−1 log(kr`)

for a sufficiently small constant c > 0. Then there is a poly(N ) time reduction A from graphs on N vertices
to graphs on n vertices satisfying that
  
γ γ
dTV A (GE (N, k, p, q)) , ISBMD n, r, P0 + γ, P0 − , P0 +
k−1 (k − 1)2
!
kµ3 r3/2 k 2
=O + √ + e−Ω(N /km) + (kr`)−1
r2t N
 2

dTV (A (G(N, q)) , G(n, P0 )) = O e−Ω(N /km) + (kr`)−1

Proof. Consider the reduction A that adds a simple post-processing step to k- PDS - TO - ISBM as follows. On
input graph G with N vertices:

1. Form the graph G1 by applying k- PDS - TO - ISBM to G with parameters N, r, k, E, `, n and µ where µ
is given by
rt+1
 
−1 1 1 −1
µ= ·Φ + · min{P0 , 1 − P0 } · γ
(r − 1)2 2 2
and Φ−1 is the inverse of the standard normal CDF.

2. If P0 ≤ 1/2, output the graph G2 formed by independently including each edge of G1 in G2 with
probability 2P0 . If P0 > 1/2, form G2 instead by including each edge of G1 in G2 and including
each non-edge of G1 in G2 as an edge independently with probability 2P0 − 1.

This clearly runs in poly(N ) time and it suffices to establish its approximate Markov transition properties.
Let A1 and A2 denote the two steps above with input-output pairs (G, G1 ) and (G1 , G2 ), respectively. Let
C ⊆ [n] be a fixed subset of size n/r and define

µ(r − 1)2
   
µ(r − 1)  µ 
P11 = Φ , P12 = Φ − and P 22 = Φ
rt+1 rt+1 rt+1
0 0 γ 0 γ
P11 = P0 + γ, P12 = P0 − and P22 = P0 +
r−1 (r − 1)2
We will show that
!
0 0 0
 kµ3 r3/2
dTV A2 (ISBMD (n, C, P11 , P12 , P22 )) , ISBM D n, C, P11 , P12 , P22 =O = o(1) (13)
r2t

where the upper bound is o(1) since kr3/2 = o(r2t ). First consider the case where P0 ≤ 1/2. Step 2 above
yields by construction that

A2 (ISBMD (n, C, P11 , P12 , P22 )) ∼ ISBMD (n, C, 2P0 P11 , 2P0 P12 , 2P0 P22 )

112
Suppose that X(r) ∈ {0, 1}m is sampled by first sampling X 0 ∼ Bin(m, r) and then letting X be selected
uniformly at random from all elements of {0, 1}m with support size X 0 . It follows that X(r) ∼ Bern(r)⊗m
since both distributions are permutation-invariant and their support sizes have the same distribution. Now
the data-processing inequality in Fact 6.2 implies that
dTV Bern(r)⊗m , Bern(r0 )⊗m = dTV X(r), X(r0 ) ≤ dTV Bin(m, r), Bin(m, r0 )
  

which can be upper bounded with Lemma 6.5. Using the fact that the edge indicators of ISBM conditioned
on C are independent, the tensorization property in Fact 6.2 and Lemma 6.5, we now have that
0 0 0

dTV ISBMD (n, C, 2P0 P11 , 2P0 P12 , 2P0 P22 ) , ISBMD n, C, P11 , P12 , P22
n2 (r−1) n2 (r−1)
   
⊗(n/r ) 0 ⊗(n/r ) ⊗ 0 ⊗ r2
≤ dTV Bern(2P0 P11 ) 2 , Bern(P11 ) 2 + dTV Bern(2P0 P12 ) r 2 , Bern(P12 )
 n(1−1/r)

0 ⊗(n(1−1/r)
+ dTV Bern(2P0 P22 )⊗( 2 ) , Bern(P22 ) 2 )
s s
n/r

0
2
0
n2 (r − 1)
≤ 2P0 P11 − P11 · 0 (1 − P 0 ) + 2P0 P12 − P12 ·
0 (1 − P 0 )
2P11 11 2r2 P12 12
s
n(1−1/r)

0 2

+ 2P0 P22 − P22 · 0 (1 − P 0 )
2P22 22
n  
0
0
n 0

≤ 2P0 P11 − P11 · O
+ 2P0 P12 − P12 · O √
+ 2P0 P22 − P22 · O(n)
r r
0 , P 0 and P 0 are each bounded away from 0 and 1. Observe
where the third inequality uses the fact that P11 12 22
that the definition of µ ensures
µ(r − 1)2
 
1 1
+ ·γ =Φ
2 2P0 rt+1
which implies that 2P0 P11 = P11 0 . We now use a standard Taylor approximation for the error function

Φ(x) − 1/2 around zero, given by Φ(x) = 21 + √x2π + O(x3 ) when x ∈ (−1, 1). Observe that
 
0

2P0 P12 − P12 = 2P0 · Φ − µ(r − 1) 1 γ
t+1
− +
r 2 2P0 (r − 1)
µ(r − 1)2
     
µ(r − 1) 1 1 1
= 2P0 · Φ − t+1
− + Φ −
r 2 r−1 rt+1 2
 3 2
µ r
=O
r3t
0 | = O µ3 /r 3t−1 . Combining all of these bounds

An analogous computation shows that |2P0 P22 − P22
now yields Equation (13) after noting that n = kr` = O(krt ) implies that nµ3 r3/2 /r3t = O(kr3/2 /r2t ). A
nearly identical argument considering the complement of the graph G1 and replacing with P0 with 1 − P0
establishes Equation (13) in the case when P0 > 1/2. Now observe that
A2 (G(n, 1/2)) ∼ G(n, P0 )
by definition. Now consider applying Lemma 6.3 to the steps A1 and A2 using an analogous recipe as in
the proof of Theorem 14.2. We have that 1 is bounded by Theorem 14.2 and 2 is bounded by the argument
above. Note that in order to apply Theorem 14.2 here, it must follow that the required bound on µ is met.
Observe that
µ(r − 1)2
   
1  µ 
γ = 2P0 Φ − = Θ
rt+1 2 rt−1

113
and hence if γ satisfies the upper bound in the statement of the corollary for a sufficiently small constant
c, then µ satisfies the requirement in Theorem 14.2 since p and q are constant. This application of Lemma
6.3 now yields the desired two approximate Markov transition properties and completes the proof of the
corollary.

We now show that setting parameters in the reduction of Corollary 14.5 as in the recipe set out in
Theorems 3.1 and 13.4 now shows that we can fill out the parameter space for ISBM obeying the edge
density constraints of (12) below the Kesten-Stigum threshold. This proves the following computational
lower bound for ISBM. We remark that typically the parameter regime of interest for the k-block stochastic
block model is when k = no(1) , and thus the conditions ( T ) and k = o(n1/3 ) are only mild restrictions here.
Note that the condition ( T ) here is the same condition that was introduced in Section 13.1.
Theorem 3.2 (Lower Bounds for ISBM). Suppose that (n, k) satisfy condition ( T ), that k is prime or k =
ωn (1) and k = o(n1/3 ), and suppose that P0 ∈ (0, 1) satisfies min{P0 , 1 − P0 } = Ωn (1). Consider the
testing problem ISBM(n, k, P11 , P12 , P22 ) where
γ γ
P11 = P0 + γ, P12 = P0 − and P22 = P0 +
k−1 (k − 1)2
Then the k- PC conjecture or k- PDS conjecture for constant 0 < q < p ≤ 1 both imply that there is a
computational lower bound for ISBM(n, k, P11 , P12 , P22 ) at all levels of signal below the Kesten-Stigum
threshold of γ 2 = õ(k 2 /n).
Proof. It suffices to show that the reduction A in Corollary 14.5 applied with r ≥ 2 fills out all of the
possible growth rates specified by the computational lower bound γ 2 = õ(k 2 /n) and the other conditions in
the theorem statement. Fix a constant pair of probabilities 0 < q < p ≤ 1 and any sequence of parameters
(n, k, γ, P0 ) all of which are implicitly functions of n such that (n, k) satisfies ( T ) and
k2
γ2 ≤ , 2(w0 )2 k ≤ n1/3 and min{P0 , 1 − P0 } = Ωn (1)
w0 · n log n
for sufficiently large n and w0 = w0 (n) = (log n)c for a sufficiently large constant c > 0. Now let w =
w(n) → ∞ be an arbitrarily slow-growing increasing positive integer-valued function at least satisfying that
w(n) = no(1) . As in the proof of Theorem 3.1, we now specify the following in order to fulfill the criteria
in Condition 6.1:
1. a sequence (N, kN ) such that the k- PDS(N, kN , p, q) is hard according to Conjecture 2.3; and
2. a sequence (n0 , k 0 , γ, P0 ) with a subsequence that satisfies three conditions: (2.1) the parameters on
the subsequence are in the regime of the desired computational lower bound for ISBM; (2.2) they
have the same growth rate as (n, k, γ, P0 ) on this subsequence; and (2.3) such that ISBM with the
parameters on this subsequence can be produced by A with input k- PDS(N, kN , p, q).
As discussed in Section 6.2, this is sufficient to prove the theorem. We choose these parameters as follows:
• let k 0 = r be the smallest prime satisfying that k ≤ r ≤ 2k, which exists by Bertrand’s postulate and
can be found in poly(n) time;

• let t be such that rt is the closest power of r to n and let
$  %
p −1 −2  t √

1
kN = 1+ w · min r , n
2 Q
p √ 
where Q = 1 − (1 − p)(1 − q) + 1{p=1} q − 1 ; and

114
rt −1
• let n0 = kN r` where ` = r−1
2 .
and let N = wkN

Note that we have that w2 r ≤ n1/3 since r ≤ 2k. Now observe that we have the following bounds
 t 
rt
 
0 t −2 r
n  kN r  w · min √ , 1 · √ n
n n
 √ √  n 
kN r3/2 . w−2 · min rt , n · w−3 n . w−4 · 2t r2t
   √  r
p n
m≤2 + 1 wkN 2
≤ w−3 · t kN rt
Q r
kN r` ≤ poly(N )
k2 1 r2t log(kN r`)
γ2 ≤ = 0 2t−2 ·
w0
· n log n w ·r log(kN r`) n log n
2 log n0
 t  t r2 rt
 
2 r −2 r r
γ . 0 0 w · min √ , 1 · √ · . · √
w · n log n0 n n log n w0 · w2 · n0 log n0 n

p
 √
where m is the smallest multiple of kN larger Q + 1 N . Now observe that as long as n = Θ̃(rt ) then:
(2.1) the last inequality above on γ 2 would imply that (n0 , k 0 , γ, P0 ) is in the desired hard regime; (2.2) n
and n0 have the same growth rate since w = no(1) , and k and k 0 = r have the same growth rate since either
k 0 = k or k 0 = Θ(k) = ω(1); and (2.3) the middle four bounds above imply that taking c large enough
yields the conditions needed to apply Corollary 14.5 to yield the desired reduction. By Lemma 13.2, there

is an infinite subsequence of the input parameters such that n = Θ̃(rt ), which concludes the proof as in
Theorem 3.1.

14.2 Testing Hidden Partition Models


In this section, we establish statistical-computational gaps based on the k- PC and k- PDS conjectures for
detection in the Gaussian and bipartite hidden partition models introduced in Sections 3.3 and 6.3. These
two models are bipartite analogues of the subgraph variants of the k-block stochastic block model in the
constant edge density regime. Specifically, they are multiple-community variants of the subgraph stochastic
block model considered in [BBH18].
The motivation for considering these two models is to illustrate the versatility of Bernoulli rotations as
a reduction primitive. These two models are structurally very different from planted clique yet can be pro-
duced through Bernoulli rotations for appropriate choices of the output mean vectors A1 , A2 , . . . , Am . The
mean vectors specified in the reduction are vectorizations of the slices of the design tensor Tr,t constructed
based on the incidence geometry of Ftr . The definition of Tr,t and several of its properties can be found in
Section 8.3. The reduction in this section demonstrates that natural applications of Bernoulli rotations can
require more involved constructions than Kr,t in order to produce tight computational lower bounds.
We begin by reviewing the definitions of the two main models considered in this section – Gaussian and
bipartite hidden partition models – which were introduced in Sections 3.3 and 6.3.
Definition 14.6 (Gaussian Hidden Partition Models). Let n, r and K be positive integers, let γ ∈ R and
let C = (C1 , C2 , . . . , Cr ) be a sequence of disjoint K-subsets of [n]. Let D = (D1 , D2 , . . . , Dr ) be
another such sequence. The distribution GHPMD (n, r, C, D, γ) over matrices M ∈ Rn×n is such that
Mij ∼i.i.d. N (µij , 1) where

 γ if i ∈ Ch and j ∈ Dh for some h ∈ [r]
γ
µij = − r−1 if i ∈ Ch1 and j ∈ Dh2 where h1 6= h2

0 otherwise

115
for each i, j ∈ [n]. Furthermore, let GHPMD (n, r, K, γ) denote the mixture over GHPMD (n, r, C, D, γ)
induced by choosing C and D independently and uniformly at random.

Definition 14.7 (Bipartite Hidden Partition Models). Let n, r, K, C and D be as in Definition 14.6 and let
P0 , γ ∈ (0, 1) be such that γ/r ≤ P0 ≤ 1 − γ. The distribution BHPMD (n, r, C, D, P0 , γ) over bipar-
tite graphs G with two parts of size n, each indexed by [n], such that each edge (i, j) is included in G
independently with the following probabilities

 P0 + γ if i ∈ Ch and j ∈ Dh for some h ∈ [r]
γ
P [(i, j) ∈ E(G)] = P0 − r−1 if i ∈ Ch1 and j ∈ Dh2 where h1 6= h2

P0 otherwise

for each i, j ∈ [n]. Let BHPMD (n, r, K, P0 , γ) denote the mixture over BHPMD (n, r, C, D, P0 , γ) induced
by choosing C and D independently and uniformly at random.

The problems we consider in this section are the two simple hypothesis testing problems GHPM and
BHPM from Section 6.3, given by

H0 : M ∼ N (0, 1)⊗n×n and H1 : M ∼ GHPM(n, r, K, γ)


H0 : G ∼ GB (n, n, P0 ) and H1 : G ∼ BHPM(n, r, K, P0 , γ)

An important remark is that the hypothesis testing formulations above for these two problems seem to have
different computational and statistical barriers from the tasks of recovering C and D. We now state the
following lemma, giving guarantees for a natural polynomial-time test and exponential time test for GHPM.
The proof of this lemma is tangential to the main focus of this section – computational lower bounds for
GHPM and BHPM – and is deferred to Appendix B.2.

Lemma 14.8 (Tests for GHPM). Given a matrix M ∈ Rn×n , let sC (M ) = ni,j=1 Mij2 − n2 and
P

 
Xr X X 
sI (M ) = max Mij
C,D  
h=1 i∈Ch j∈Dh

where the maximum is over all pairs (C, D) of sequences of disjoint K-subsets of [n]. Let w = w(n) be any
increasing function with w(n) → ∞ as n → ∞. We prove the following:

1. If M ∼ GHPMD (n, r, K, γ), then with probability 1 − on (1) it holds that

rK 2 √
 
2 2 2 Kγ
sC (M ) ≥ rK γ + · γ − w n + γK r + and sI (M ) ≥ rK 2 γ − wr1/2 K
r−1 r

2. If M ∼ N (0, 1)⊗n×n , then with probability 1 − on (1) it holds that


p
sC (M ) ≤ wn and sI (M ) ≤ 2rK 3/2 w (log n + log r)

This lemma implies upper bounds on the computational and statistical barriers for GHPM. Specifically, it
implies that the variance test sC succeeds above γcomp2 = Θ̃(n/rK 2 ) and the search test sI succeeds above
2 = Θ̃(1/K). Thus, showing that there is a computational barrier at this level of signal γ
γIT comp is sufficient to
show that there is a nontrivial statistical-computational gap for GHPM. For P0 with min{P0 , 1−P0 } = Ω(1),
analogous tests show the same upper bounds on γcomp and γIT for BHPM.

116
Algorithm k- PDS - TO - GHPM
Inputs: k- PDS instance G ∈ GN with dense subgraph size k that divides N , and the following parameters

• partition E, edge probabilities 0 < q < p ≤ 1, Q ∈ (0, 1) and m as in Figure 12


rt −1
• refinement parameter s and number of vertices n = ksrt where r is a prime number, ` = r−1 for
some t ∈ N satisfy that m ≤ ks(r − 1)` ≤ poly(N )

• mean parameter µ ∈ (0, 1) as in Figure 12

1. Symmetrize and Plant Diagonals: Compute MPD1 ∈ {0, 1}m×m and F as in Step 1 of Figure 12.

2. Pad and Further Partition: Form MPD2 and F 0 as in Step 2 of Figure 12 modified so that MPD2
is a ks(r − 1)` × ks(r − 1)` matrix and each Fi0 has size s(r − 1)`. Let F s be the partition of
[ks(r − 1)`] into ks parts of size (r − 1)` by refining F 0 by splitting each of its parts into s parts
of equal size arbitrarily.

3. Bernoulli Rotations: Let F o be a partition of [ksrt ] into ks equally sized parts. Now compute the
t t
matrix MR ∈ Rksr ×ksr as follows:

(1) For each i, j ∈ [ks], flatten the (r − 1)` × (r − 1)` submatrix (MP )Fis ,Fjs into a vector
2 `2 > ∈ Rr 2t ×(r−1)2 `2
Vij ∈ R(r−1) and let A = Mr,t as in Definition 8.9.
(2) Apply B ERN -ROTATIONS to Vij with matrix A, rejection kernel parameter
p RRK = ksrt ,
2t
Bernoulli probabilities 0 < Q < p ≤ 1, output dimension r , λ = 1 + (r − 1)−1 and
mean parameter µ.
(3) Set the entries of (MR )Fio ,Fjo to be the entries of the output in (2) unflattened into a matrix.

4. Permute and Output: Output the matrix MR with its rows and columns independently permuted
uniformly at random.

Figure 13: Reduction from k-partite planted dense subgraph to gaussian hidden partition models.

Consider the case when n = rK, which corresponds to a testing variant of the bipartite k-block stochas-
tic block model. In this case, the upper bounds shown by the previous lemma coincide at γcomp 2 2 =
, γIT
O(r/n) and hence do not support the existence of a statistical-computational gap. The subgraph formula-
tion in which rK  n seems crucial to yielding a testing problem with a statistical-computational gap. We
also remark that while this testing formulation when n = rK may not have a gap, the task of recovering C
and D likely shares the gap conjectured in the k-block stochastic block model. Specifically, the conjectured
computational barrier at the Kesten-Stigum threshold lies at γ 2 = Θ̃(r2 /n), which lies well above the r/n
limit in the testing formulation.
The rest of this section is devoted to giving our main reduction k- PDS - TO - GHPM showing a compu-
tational barrier at γ 2 = õ(n/rK 2 ). This reduction is shown in Figure 13 and its approximate Markov
transition guarantees are stated in the theorem below. The intuition behind why our reduction is tight to
the algorithm sC is as follows. Bernoulli rotations are approximately `2 -norm preserving in the signal to
noise ratio if the output dimension is comparable to the input dimension with m  n. Much of the effort
in constructing Tr,t and Mr,t in Section 8.3 was devoted to the linear functions L which are crucial in de-

117
signing Mr,t to be nearly square and hence achieve m  n in Bernoulli rotations. Any reduction that is
approximately `2 -norm preserving in the signal to noise ratio will be tight to a variance test such as sC .
The key to the reduction k- PDS - TO - GHPM lies in the construction of Tr,t and Mr,t in Section 8.3. The
rest of the proof of the following theorem is similar to the proofs in the previous section. We omit details that
are similar for brevity. We recall from Section 6.4 that, given a matrix M ∈ Rn×n , the matrix MS,T ∈ Rk×k
where S, T are k-subsets of [n] refers to the minor of M restricted to the row indices in S and column
indices in T . Furthermore, (MS,T )i,j = MσS (i),σT (j) where σS : [k] → S is the unique order-preserving
bijection and σT is analogously defined.

Theorem 14.9 (Reduction to GHPM). Let N be a parameter and r = r(N ) ≥ 2 be a prime number. Fix
initial and target parameters as follows:

• Initial k- BPDS Parameters: k, N, p, q and E as in Theorem 14.2.


rt −1
• Target GHPM Parameters: (n, r, K, γ) where n = ksrt , K = krt−1 and ` = r−1 for some parame-
ters t = t(N ), s = s(N ) ∈ N satisfying that that

m ≤ ks(r − 1)` ≤ poly(N )


µ(r−1)
where m and Q are as in Theorem 14.9. The target level of signal γ is given by γ = √
rt r
where
    
1 p 1−Q
µ≤ p · min log , log
2 6 log(ksrt ) + 2 log(p − Q)−1 Q 1−p

Let A(G) denote k- PDS - TO - GHPM applied to the graph G with these parameters. Then A runs in poly(N )
time and it follows that
 
k −Ω(N 2 /km) t −1
dTV (A (GE (N, k, p, q)) , GHPMD (n, r, K, γ)) = O √ +e + (ksr )
N
 2

dTV A (G(N, q)) , N (0, 1)⊗n×n = O e−Ω(N /km) + (ksrt )−1


In order to state the approximate Markov transition guarantees of the Bernoulli rotations step of k-
PDS - TO - GHPM , we need the formalism from Section 8.3 to describe the matrix Mr,t , tensor Tr,t and their
community alignment properties. While this will require a plethora of cumbersome notation, the goal of
the ensuing discussion is simple – we will show that Lemma 8.11 guarantees that stitching together the
individual applications of B ERN -ROTATIONS in Step 3 of k- PDS - TO - GHPM yields a valid instance of GHPM.
t t
Recall C(M 1,1 , M 1,2 , . . . , M ks,ks ) denotes the concatenation of k 2 s2 matrices M i,j ∈ Rr ×r into a
ksrt × ksrt matrix, as introduced in Section 8.3. Given a partition F of [ksrt ] into ks equally sized parts,
let CF (M 1,1 , M 1,2 , . . . , M ks,ks ) denote the concatenation of the M i,j , where now the entries of M i,j appear
in CF on the index set Fi ×Fj . For consistency, we fix a canonical embedding of the row and column indices
t t
of Rr ×r to Fi × Fj by always preserving the order of indices.
Let F o and F s be fixed partitions of [ksrt ] and [ks(r − 1)`] into k parts of size rt and (r − 1)`,
respectively, and let S ⊆ [ks(r − 1)`] be such that |S| = k and S intersects each part of F s in at most one
t t
element. Now let MS,F s ,F o (Tr,t ) ∈ Rksr ×ksr be the matrix
(
(Vt ,Vt ,Lij )
Tr,t i j if S ∩ Fis 6= ∅
 
1,1 1,2 ks,ks i,j
MS,F s ,F o (Tr,t ) = CF o M , M , . . . , M where M =
0 otherwise

where ti , tj and Lij are given by:

118
• let σ : [ks(r − 1)`] → [ks(r − 1)`] be the unique bijection transforming the partition F s to the
canonical contiguous partition {1, . . . , (r − 1)`} ∪ · · · ∪ {(ks − 1)(r − 1)` + 1, . . . , ks(r − 1)`} while
preserving ordering on each part Fis for 1 ≤ i ≤ ks;

• let s0i be the unique element in σ(S ∩ Fis ) for each i for which this intersection is nonempty, and let
si be the unique positive integer with 1 ≤ si ≤ (r − 1)` and si ≡ s0i (mod (r − 1)`); and

• ti , tj and Lij are as in Lemma 8.11 given these si i.e. ti and tj are the unique 1 ≤ ti , tj ≤ ` such
that ti ≡ si (mod `) and tj ≡ sj (mod `) and Lij : Fr → Fr is given by Lij (x) = ai x + aj where
ai = dsi /`e and aj = dsj /`e.
The next lemma makes explicit the implications of Lemma 8.1 and Lemma 8.11 for the approximate Markov
transition guarantees of Step 3 in k- PDS - TO - GHPM. The proof follows a similar structure to the proof of
Lemma 14.3 and we omit identical details.
Lemma 14.10 (Bernoulli Rotations for GHPM). Let F o and F s be a fixed partitions of [ksrt ] and [ks(r−1)`]
into k parts of size rt and (r − 1)`, respectively, and let S ⊆ [ksrt ] be such that |S| = k and |S ∩ Fis | ≤ 1
for each 1 ≤ i ≤ ks. Let A3 denote Step 3 of k- PDS - TO - GHPM with input MPD2 and output MR . Suppose
that p, Q and µ are as in Theorem 14.2, then it follows that
 
dTV A3 M[ks(r−1)`]×[ks(r−1)`] (S × S, Bern(p), Bern(Q)) ,
r !!
r−1 t t
· MS,F s ,F o (Tr,t ) + N (0, 1)⊗ksr ×ksr = O (ksrt )−1

L µ
r
   t t

dTV A3 Bern(Q)⊗ks(r−1)`×ks(r−1)` , N (0, 1)⊗ksr ×ksr = O (ksrt )−1


and furthermore, for all such subsets S, it holds that the matrix MS,F s ,F o (Tr,t ) has zero entries other than
in a krt × krt submatrix, which is also r-block as defined in Section 8.3.
Proof. Define s0i , si , ti and Lij as in the preceding discussion for all i, j with S ∩ Fis and S ∩ Fjs nonempty.
Let (1) and (2) denote the following two cases:
1. MPD2 ∼ M[ks(r−1)`]×[ks(r−1)`] (S × S, Bern(p), Bern(Q)); and

2. MPD2 ∼ Bern(Q)⊗ks(r−1)`×ks(r−1)` .
Now define the matrix MR0 with independent entries such that
(Vti ,Vtj ,Lij )
( q t t
µ r−1r · Tr,t + N (0, 1)⊗r ×r if (1) holds, S ∩ Fis 6= ∅ and S ∩ Fjs 6= ∅
MR0 F s ,F s ∼

t t
i j
N (0, 1)⊗r ×r otherwise if (1) holds or if (2) holds

for each 1 ≤ i, j ≤ ks. The vectorization and ordering conventions we adopt imply that if S ∩ Fis 6= ∅ and
S ∩ Fjs 6= ∅, then the unflattening of the row with index (si − 1)(r − 1)` + sj in Mr,t is the approximate
output mean of A3 on the minor (MR )Fis ,Fjs when applying Lemma 8.1 under (1). By Definition 8.9 and the
definitions of ai , ti and Lij , this unflattened row is exactly the matrix
(Vti ,Vtj ,Lij )
M i,j = Tr,t

Combining this observation with Lemmas 8.1 and 8.10 yields that under both (1) and (2), we have that
 
dTV (MR )F s ,F s , MR0 F s ,F s = O r2t · (ksrt )−3
 
i j i j

119
for all 1 ≤ i, j ≤ ks. Through the same argument as in Lemma 14.3, thetensorization property of total
variation in Fact 6.2 now yields that dTV (L(MR ), L(MR0 )) = O (ksrt )−1 under both (1) and (2). Now
note that the definition of CF o implies that
( q t t
0 µ r−1r · MS,F s ,F o (Tr,t ) + N (0, 1)⊗ksr ×ksr if (1) holds
MR ∼ t t
N (0, 1)⊗ksr ×ksr if (2) holds

which completes the proof of the approximate Markov transition guarantees in the lemma statement. Now
note that MS,F s ,F o (Tr,t ) is zero everywhere other than on the union U of the Fio over the i such that
S ∩ Fis 6= ∅. There are exactly k such i and thus |U | = krt . Note that r-block matrices remain r-
block matrices under permutations of column and row indices, and therefore Lemma 8.11 implies the same
conclusion if C is replaced by CF o . Applying Lemma 8.11 to the submatrix of MS,F s ,F o (Tr,t ) restricted to
the indices of U now completes the proof of the lemma.

We now complete the proof of Theorem 14.9, again applying Lemma 6.3 as in the proofs of Theorems
10.2 and 14.2. In this theorem, we let Unk (F ) denote the uniform distribution over subsets S ⊆ [n] of size k
intersecting each part of the partition F in at most one element. When F has exactly k parts, this definition
recovers the previously defined distribution Un (F ).

Proof of Theorem 14.9. Let the steps of A to map inputs to outputs as follows
A A A A
1
(G, E) −−→ 2
(MPD1 , F ) −−→ (MPD2 , F s ) −−→
3
(MR , F o ) −→
4
MR0

where here MR0 denotes the permuted form of MR after Step 4. Under H1 , consider Lemma 6.3 applied to
the following sequence of distributions

P0 = GE (N, k, p, q)
P1 = M[m]×[m] (S × S, Bern(p), Bern(Q)) where S ∼ Um (F )
k
P2 = M[ks(r−1)`]×[ks(r−1)`] (S × S, Bern(p), Bern(Q)) where S ∼ Uks(r−1)` (F s )
r
r−1 t t
P3 = µ · MS,F s ,F o (Tr,t ) + N (0, 1)⊗ksr ×ksr where S ∼ Uks(r−1)`
k
(F s )
r
 
t t−1 µ(r − 1)
P4 = GHPMD ksr , r, kr , t √
r r
n o
Q
Let CQ = max 1−Q , 1−Q
Q and consider setting

 r
Q2 N 2 CQ k 2

3 = O (ksrt )−1

1 = 4k · exp − + , 2 = 0, and 4 = 0
48pkm 2m

As in the proof of Theorem 14.2, Lemma 7.5 implies this is a valid choice of 1 and A2 is exact so we can
k
take 2 = 0. The choice of 3 is valid by applying Lemma 14.10 and averaging over S ∼ Uks(r−1)` (F s )
t t
that the kr × kr r-block submatrix
using the conditioning property of total variation in Fact 6.2. Now noteq
r−1
of MS,F s ,F o (Tr,t ) has entries √
rt r−1
and − rt √1r−1 . Thus the matrix µ r−1
r · MS,F s ,F o (Tr,t ) is of the form
of the mean matrix (µij )1≤i,j≤ksrt in Definition 14.6 for some choice of C and D where K = krt−1 and
r
r−1 r−1 µ(r − 1)
γ=µ · t√ = √
r r r−1 rt r

120
This implies that permuting the rows and columns of P3 yields P4 exactly with 4 = 0. Applying Lemma
6.3 now yields the first bound in the theorem statement. Under H0 , consider the distributions
t t
P0 = G(N, q), P1 = Bern(Q)⊗m×m , P2 = Bern(Q)⊗ks(r−1)`×ks(r−1)` , P3 = P4 = N (0, 1)⊗ksr ×ksr
 
Q2 N 2
As above, Lemmas 7.5 and 14.10 imply that we can take 1 = 4k · exp − 48pkm and 2 , 3 and 4 as
above. Lemma 6.3 now yields the second bound in the theorem statement.

We now append a final post-processing step to the reduction k- PDS - TO - GHPM to map to BHPM. The
proof of the following corollary is similar to that of Corollary 14.5 and is deferred to Appendix B.2.
Corollary 14.11 (Reduction from GHPM to BHPM). Let 0 < q < p ≤ 1 be constant and let the parameters

k, N, E, r, `, n, s and K be as in Theorem 14.9 with the additional condition that k r = o(r2t ). Let
γ ∈ (0, 1) be such that
c(r − 1)
γ≤ p
r r log(ksrt )
t

for a sufficiently small constant c > 0. Suppose that P0 satisfies min{P0 , 1 − P0 } = Ω(1). Then there is a
poly(N ) time reduction A from graphs on N vertices to graphs on n vertices satisfying that
 3√ 
kµ r k −Ω(N 2 /km) t −1
dTV (A (GE (N, k, p, q)) , BHPMD (n, r, K, P0 , γ)) = O + √ +e + (ksr )
r2t N
 2

dTV (A (G(N, q)) , GB (N, N, P0 )) = O e−Ω(N /km) + (ksrt )−1

Collecting the results of this section, we arrive at the following computational lower bounds for GHPM
and BHPM matching the efficient test sC in Lemma 14.8.
Theorem 3.3 (Lower Bounds for GHPM and BHPM). Suppose that r2 K 2 = ω̃(n) and (dr2 K 2 /ne, r) satis-
fies condition ( T ), suppose r is prime or r = ωn (1) and suppose that P0 ∈ (0, 1) satisfies min{P0 , 1−P0 } =
Ωn (1). Then the k- PC conjecture or k- PDS conjecture for constant 0 < q < p ≤ 1 both imply that there
is a computational lower bound for each of GHPM(n, r, K, γ) for all levels of signal γ 2 = õ(n/rK 2 ). This
same lower bound also holds for BHPM(n, r, K, P0 , γ) given the additional condition n = o(rK 4/3 ).
Proof. The proof of this theorem will follow that of Theorem 3.2 with several modifications. We begin by
showing a lower bound for GHPM. It suffices to show that the reduction k- PDS - TO - GHPM fills out all of the
possible growth rates specified by the computational lower bound γ 2 = õ(n/rK 2 ) and the other conditions
in the theorem statement. Fix a constant pair of probabilities 0 < q < p ≤ 1 and any sequence of parameters
(n, r, K, γ) all of which are implicitly functions of n such that (dr2 K 2 /ne, r) satisfies ( T ) and
n
γ2 ≤ c · 0 and r2 K 2 ≥ w0 n
w · rK 2 log n
for sufficiently large n and w0 = w0 (n) = (log n)c for a sufficiently large constant c > 0. Now let w =
w(n) → ∞ be an arbitrarily slow-growing increasing positive integer-valued function at least satisfying
that w(n) = no(1) . As in Theorem 3.2, we now specify the following parameters which are sufficient to
establish the lower bound for GHPM:
1. a sequence (N, kN ) such that k- PDS(N, kN , p, q) is hard according to Conjecture 2.3; and
2. a sequence (n0 , r0 , K 0 , γ, s, t, µ) with a subsequence that satisfies three conditions: (2.1) the param-
eters on the subsequence are in the regime of the desired computational lower bound for GHPM;
(2.2) the parameters (n0 , r0 , K 0 , γ) have the same growth rate as (n, r, K, γ) on this subsequence;
and (2.3) such that GHPM(n0 , r0 , K 0 , γ) with the parameters on this subsequence can be produced by
k- PDS - TO - GHPM with input k- PDS(N, kN , p, q) applied with additional parameters s, t and µ.

121
We choose these parameters as follows:

• let r0 = r be the smallest prime satisfying that r ≤ r0 ≤ 2r, which exists by Bertrand’s postulate and
can be found in poly(n) time;

√ γ(r0 )t r0
• let t be such that (r0 )t is the closest power of r0 to r0 K/ n, let s = dn/r0 Ke and let µ = r −1 ;
0

• now let kN be given by


$  %
p −1 −2 √
 
1 K
kN = 1+ w · min , n
2 Q (r0 )t−1
p √ 
where Q = 1 − (1 − p)(1 − q) + 1{p=1} q − 1 ; and

• let K 0 = kN (r0 )t−1 , let n0 = kN s(r0 )t and let N = wkN


2 .

Now observe that we have the following bounds


√ 
(r0 )t−1 n
 
0 0 t −2
n  kN s(r )  w · min 1, n
K
√ 
n0 (r0 )t−1 n
 
0 0 t−1 −2
K  kN (r ) = 0  w · min 1, K
rs K
     
p −1 K K
m≤2 2
+ 1 wkN ≤ w · min √ , 1 · 0 t−1 √ kN s(r0 − 1)`
Q (r0 )t−1 n (r ) n
kN s(r0 − 1)` ≤ poly(N )
 0 0 2
0 2 0 2 rK n
(r ) (K ) ≥ · 0 · w 0 n0
rK n
√ r √
γ(r0 )t r0 r0 (r0 )t−1 n 2
µ= ≤ · · 0 1/2 √
r0 − 1 r K (w ) log n
n 0 0 0
r n (K ) log n 0 2 0
γ2 . 0 0 0 2 0
· · · ·
w · r (K ) log n r n K2 log n
0 t −1
 
where m is the smallest multiple of kN larger Q p
+ 1 N and ` = (rr0)−1 . Now observe that as long as

r K/ n = Θ̃((r ) ) then: (2.1) the last inequality above on γ would imply that (n0 , r0 , K 0 , γ) is in the
0 0 t 2

desired hard regime; (2.2) the pairs of parameters (n, n0 ), (K, K 0 ) and (r, r0 ) have the same growth rates
since w = no(1) and either r0 = r or r0 = Θ(r) = ω(1); and (2.3) the third through sixth bounds above imply
that taking c large enough yields the conditions needed to apply Corollary 14.5 to yield the desired reduction.

By Lemma 13.2, there is an infinite subsequence of the input parameters such that r0 K/ n = Θ̃((r0 )t ),
which concludes the proof of the lower bound for GHPM as in Theorems 3.1 and 3.2.
The computational lower bound for BHPM follows from the same argument applied to A from Corollary
14.11 with the following modification. The conditions in the theorem statement√for BHPM add the initial
condition that rK 4/3 ≥ w0 n. The parameter settings above then imply that kN r0 = õ((r0 )2t ) holds on

the parameter subsequence with r0 K/ n = Θ̃((r0 )t ). The same reasoning above then yields the desired
computational lower bound for BHPM and completes the proof of the theorem.

122
14.3 Semirandom Single Community Recovery
In this section, we show that the k- PC and k- PDS conjectures with constant edge density imply the PDS
Recovery Conjecture under a semirandom adversary in the regime of constant ambient edge density. The
PDS Recovery Conjecture and formulations of semirandom single community recovery here are as they were
introduced in Sections 3.4 and 6.3. Our reduction from k- PDS to SEMI - CR is shown in Figure 14. On a high
level, our main observation is that an adversary in SEMI - CR with subgraph size k can simulate the problem
of detecting for the presence of a hidden ISBM instance on a subgraph with O(k) in an n-vertex Erdős-
Rényi graph. Furthermore, combining the Bernoulli rotations step with K3,t as in k- PDS - TO - ISBM with the
partition refinement of k- PDS - TO - GHPM can be shown to map to this detection problem. Furthermore, it
faithfully recovers the Kesten-Stigum bound from the PDS Recovery Conjecture as opposed to the slower
detection rate. The key proofs in this section resemble similar proofs in the previous two sections. We omit
details that are similar for brevity.
Before proceeding with the main proofs of this section, we discuss the relationship between our results
and the reduction of [CLR15]. In [CLR15], the authors prove a detection-recovery gap in the context of sub-
Gaussian submatrix localization based on the hardness of finding a planted k-clique in a random n/2-regular
graph. This degree-regular formulation of PC was previously considered in [DM15a] and differs in a number
of ways from PC. For example, it is unclear how to generate a sample from the degree-regular variant in
polynomial time. We remark that the reduction of [CLR15], when instead applied the usual formulation of
PC produces a matrix with highly dependent entries. Specifically, the sum of the entries of the output matrix
has variance n2 /µ where µ  1 is the mean parameter for the submatrix localization instance whereas an
output matrix with independent entries of unit variance would have a sum of entries of variance n2 . Note
that, in general, any reduction beginning with PC that also preserves the natural H0 hypothesis cannot show
the existence of a detection-recovery gap, as any lower bounds for localization would also apply to detection.
Formally, the goal of this section is to show that the reduction k PDS - TO - SEMI - CR in Figure 14 maps
from k- PC and k- PDS to the following distribution under H1 , for a particular choice of µ1 , µ2 and µ3 just
below the PDS Recovery Conjecture. We remark that k- PDS - TO - SEMI - CR maps to the specific case where
P0 = 1/2. This reduction is extended in Corollary 14.15 to handle P0 6= 1/2 with min{P0 , 1−P0 } = Ω(1).

Definition 14.12 (Target SEMI - CR Instance). Given positive integers k, k 0 ≤ n and P0 , µ1 , µ2 , µ3 ∈ (0, 1)
satisfying that µ1 , µ2 ≤ P0 ≤ 1 − µ3 , let TSI(n, k, k 0 , P0 , µ1 , µ2 , µ3 ) be the distribution over G ∈ Gn
sampled as follows:

1. choose two disjoint subsets S ⊆ [n] and S 0 ⊆ [n] of sizes |S| = k and |S 0 | = k 0 , respectively,
uniformly at random; and

2. include the edge {i, j} in E(G) independently with probability pij where

if (i, j) ∈ S 02

 P0
if (i, j) ∈ [n]2 \(S ∪ S 0 )2

P0 − µ1

pij =
 P 0 − µ2
 if (i, j) ∈ S × S 0 or (i, j) ∈ S 0 × S
if (i, j) ∈ S 2

P 0 + µ3

Note that this distribution can be produced by a semirandom adversary in SEMI - CR(n, k, P0 + µ3 , P0 )
under H1 as follows:

1. samples S 0 of size k 0 uniformly at random from all k 0 -subsets of [n]\S where S is the vertex set of
the planted dense subgraph; and

123
2. if the edge {i, j} is in E(G), remove it from G independently with probability qij where

if (i, j) ∈ S 2 ∪ S 02

 0
qij = µ /P if (i, j) 6∈ (S ∪ S 0 )2
 1 0
µ2 /P0 if (i, j) ∈ S × S 0 or (i, j) ∈ S 0 × S

Note that G(n, P00 ) can be produced by the adversary under H0 of SEMI - CR(n, k, P0 + µ1 , P0 ) as long as
P00 ≤ P0 by removing all edges independently with probability 1 − P00 /P0 . Thus it suffices to map to a
testing problem between some TSI(n, k, k 0 , P0 , µ1 , µ2 , µ3 ) and G(n, P00 ).
The next theorem establishes our main Markov transition guarantees for the reduction k PDS - TO - SEMI - CR,
which map to such a testing problem when P0 = 1/2.

Theorem 14.13 (Reduction to SEMI - CR). Let N be a parameter and fix other parameters as follows:

• Initial k- BPDS Parameters: k, N, p, q and E as in Theorem 14.2.


t
• Target SEMI - CR Parameters: (n, K, 1/2 + γ, 1/2) where n = 3ks · 3 2−1 and K = (3t − 1)k for some
parameters t = t(N ), s = s(N ) ∈ N satisfying that

m ≤ 3t ks ≤ n ≤ poly(N )
µ
where m and Q are as in Theorem 14.9. The target level of signal γ is given by γ = Φ 3t − 1/2
and the target TSI densities are
 µ  1 µ 1
µ1 = Φ t+1 − and µ2 = µ3 = Φ −
3 2 3t 2
where µ ∈ (0, 1) satisfies that
    
1 p 1−Q
µ≤ p · min log , log
2 6 log n + 2 log(p − Q)−1 Q 1−p

Let A(G) denote k- PDS - TO - SEMI - CR applied to the graph G with these parameters. Then A runs in
poly(N ) time and it follows that
 
k −Ω(N 2 /km) t −1
dTV (A (GE (N, k, p, q)) , TSI(n, K, K/2, 1/2, µ1 , µ2 , µ3 )) = O √ + e + (3 ks)
N
 2

dTV (A (G(N, q)) , G (n, 1/2 − µ1 )) = O e−Ω(N /km) + (3t ks)−1

To prove this theorem, we prove a lemma analyzing the Bernoulli rotations step in Figure 14. The proof
of this lemma is similar to those of Lemmas 14.3 and 14.10. We omit details that are identical. Recall from
Section 10.1 the definition of the vector vS,F s ,F o (M ) ∈ Rab where F s and F o are partitions of [ab] into a
equally sized parts and S is a set intersecting each Fis in exactly one element. Here we extend this definition
to sets S intersecting each Fis in at most one element, by setting

M·,S∩Fis if S ∩ Fi =
6 ∅
(vS,F s ,F o (M ))F o =
i 0 if S ∩ Fi = ∅

for each 1 ≤ i ≤ a. We now can state the approximate Markov transition guarantees for the Bernoulli
rotations step of k- PDS - TO - SEMI - CR in this notation.

124
Algorithm k- PDS - TO - SEMI - CR
Inputs: k- PDS instance G ∈ GN with dense subgraph size k that divides N , and the following parameters

• partition E, edge probabilities 0 < q < p ≤ 1, Q ∈ (0, 1) and m as in Figure 12


3t −1
• refinement parameter s and number of vertices n = 3ks · 2 for some t ∈ N satisfy that
m ≤ 3t ks ≤ n ≤ poly(N )

• mean parameter µ ∈ (0, 1) as in Figure 12

1. Symmetrize and Plant Diagonals: Compute MPD1 ∈ {0, 1}m×m and F as in Step 1 of Figure 12.

2. Pad and Further Partition: Form MPD2 and F 0 as in Step 2 of Figure 12 modified so that MPD2 is
a 3t ks × 3t ks matrix and each Fi0 has size 3t s. Let F s be the partition of [3t ks] into ks parts of
size 3t by refining F 0 by splitting each of its parts into s parts of equal size arbitrarily.

3. Bernoulli Rotations: Let F o be a partition of [n] into ks equally sized parts. Now compute the
matrix MR ∈ Rn×n as follows:

(1) For each i, j ∈ [k], apply T ENSOR -B ERN -ROTATIONS to the matrix (MP )Fis ,Fjs with matrix
parameter A1 = A2 =pK3,t , Bernoulli probabilities 0 < Q < p ≤ 1, output dimension
1 t
2 (3 − 1), λ1 = λ2 = 3/2 and mean parameter µ.
(2) Set the entries of (MR )Fio ,Fjo to be the entries in order of the matrix output in (1).

4. Threshold and Output: Output the graph generated by Step 4 of Figure 12 modified so that G0 has
µ
vertex set [n] and MR is thresholded at 3t+1 .

Figure 14: Reduction from k-partite planted dense subgraph to semirandom community recovery.

Lemma 14.14 (Bernoulli Rotations for SEMI - CR). Let F s and F o be a fixed partitions of [3t ks] and [n] into
ks parts of size 3t and 12 (3t − 1), respectively, and let S ⊆ [3t ks] where |S| = k and |S ∩ Fis | ≤ 1 for each
1 ≤ i ≤ ks. Let A3 denote Step 3 of k- PDS - TO - SEMI - CR with input MPD2 and output MR . Suppose that
p, Q and µ are as in Theorem 14.13, then it follows that
 
dTV A3 M[3t ks]×[3t ks] (S × S, Bern(p), Bern(Q)) ,
 
2µ > ⊗n×n
= O (3t ks)−1

L · vS,F s ,F o (K3,t )vS,F s ,F o (K3,t ) + N (0, 1)
3
  t t
 
dTV A3 Bern(Q)⊗3 ks×3 ks , N (0, 1)⊗n×n = O (3t ks)−1


Proof. Let (1) and (2) denote the following two cases:

1. MPD2 ∼ M[3t ks]×[3t ks] (S × S, Bern(p), Bern(Q)); and


t ks×3t ks
2. MPD2 ∼ Bern(Q)⊗3 .

125
Now define the matrix MR0 with independent entries such that
 2µ > ⊗n×n if (1) holds
3 · vS,F ,F (K3,t )vS,F ,F (K3,t ) + N (0, 1)
s o s o
0
MR ∼
N (0, 1)⊗n×n if (2) holds
Similarly to Lemma 14.10, Lemmas 8.1 and 8.5 yields that under both (1) and (2), we have that
 
dTV (MR )F s ,F s , MR0 F s ,F s = O 32t · (3t ks)−3
 
i j i j

for all 1 ≤ i, j ≤ ks. The tensorization property of total variation in Fact 6.2 now yields that
dTV L(MR ), L(MR0 ) = O (3t ks)−1
 

under both (1) and (2), proving the lemma.


We now complete the proof of Theorem 14.13, which follows a similar structure as in Theorem 14.2.
Proof of Theorem 14.13. Let the steps of A to map inputs to outputs as follows
A A A A
1
(G, E) −−→ 2
(MPD1 , F ) −−→ (MPD2 , F s ) −−→
3
(MR , F o ) −→
4
G0
Under H1 , consider Lemma 6.3 applied to the following sequence of distributions
P0 = GE (N, k, p, q)
P1 = M[m]×[m] (S × S, Bern(p), Bern(Q)) where S ∼ Um (F )
P2 = M[3t ks]×[3t ks] (S × S, Bern(p), Bern(Q)) where S ∼ U3kt ks (F s )

P3 = · vS,F s ,F o (K3,t )vS,F s ,F o (K3,t )> + N (0, 1)⊗n×n where S ∼ U3kt ks (F s )
3
P4 = TSI(n, K, K/2, 1/2, µ1 , µ2 , µ3 )
n o
Q
Let CQ = max 1−Q , 1−Q
Q and consider setting
 r
Q2 N 2 CQ k 2

, 2 = 0, 3 = O (3t ks)−1

1 = 4k · exp − + and 4 = 0
48pkm 2m
Lemma 7.5 implies this is a valid choice of 1 , A2 is exact so we can take 2 = 0 and 3 is valid by applying
Lemma 14.14 and averaging over S ∼ U3kt ks (F s ) using the conditioning property of total variation in Fact
6.2. Now note that for each S the definition of vS,F s ,F o (K3,t ) implies that there are sets S1 and S2 with
t
|S1 | = (3t − 1)k and |S2 | = 3 2−1 · k such that
µ/3t if i, j ∈ S1


t
  
2µ µ −µ/3 if (i, j) ∈ S1 × S2 or (i, j) ∈ S2 × S1

· vS,F s ,F o (K3,t )vS,F s ,F o (K3,t )> = t+1 +
3 ij 3 
 0 if i, j ∈ S2

−µ/3 t+1 if i, j 6∈ (S1 ∪ S2 )
for each 1 ≤ i, j ≤ n. Permuting the rows and columns of P3 therefore yields P4 exactly with 4 = 0.
Lemma 6.3 thus establishes the first bound. Under H0 , consider the distributions
t ks×3t ks
P0 = G(N, q), P1 = Bern(Q)⊗m×m , P2 = Bern(Q)⊗3 ,
⊗n×n
P3 = N (0, 1) and P4 = G (n, 1/2 − µ1 )
 
Q2 N 2
As in Theorems 14.2 and 14.9, Lemmas 7.5 and 14.14 imply 1 = 4k · exp − 48pkm and the choices
of 2 , 3 and 4 above are valid. Lemma 6.3 now yields the second bound and completes the proof of the
theorem.

126
We now add a simple final step to k PDS - TO - SEMI - CR, reducing to arbitrary P0 6= 1/2. The guarantees
for this modified reduction are captured in the following corollary.
Corollary 14.15 (Arbitrary Bounded P0 ). Define all parameters as in Theorem 14.13 and let P0 ∈ (0, 1)
be such that η = min{P0 , 1 − P0 } = Ω(1). Then there is a poly(N ) time reduction A from graphs on N
vertices to graphs on n vertices satisfying that
 
k −Ω(N 2 /km) t −1
dTV (A (GE (N, k, p, q)) , TSI(n, K, K/2, P0 , 2ηµ1 , 2ηµ2 , 2ηµ3 )) = O √ + e + (3 ks)
N
 2

dTV (A (G(N, q)) , G (n, P0 − 2ηµ1 )) = O e−Ω(N /km) + (3t ks)−1

Proof. This corollary follows from the same reduction in the first part of the proof of Corollary 14.5. Con-
sider the reduction A that adds a simple post-processing step to k- PDS - TO - SEMI - CR as follows. On input
graph G with N vertices:
1. Form the graph G1 by applying k- PDS - TO - CR to G with parameters N, k, E, `, n, s, t and µ.

2. Form G2 as in A2 of Corollary 14.5.


This clearly runs in poly(N ) time and the second step can be verified to map TSI(n, K, K/2, 1/2, µ1 , µ2 , µ3 )
to TSI(n, K, K/2, P0 , 2ηµ1 , 2ηµ2 , 2ηµ3 ) and G (n, 1/2 − µ1 ) to G (n, P0 − 2ηµ1 ) exactly. Applying The-
orem 14.13 and Lemma 6.3 to each of these two steps proves the bounds in the corollary statement.

Summarizing the results of this section, we arrive at the desired computational lower bound for SEMI - CR.
The proof of the next theorem follows the usual recipe for deducing computational lower bounds and is
deferred to Appendix B.2.

Theorem 3.4 (Lower Bounds for SEMI - CR). If k and n are polynomial in each other with k = Ω( n) and
0 < P0 < P1 ≤ 1 where min{P0 , 1 − P0 } = Ω(1), then the k- PC conjecture or k- PDS conjecture for
constant 0 < q < p ≤ 1 both imply that there is a computational lower bound for SEMI - CR(n, k, P1 , P0 ) at
(P1 −P0 )2 2
P0 (1−P0 ) = õ(n/k ).

15 Tensor Principal Component Analysis


In this section, we: (1) give our reduction k- PST- TO - TPCA from k-partite planted sub-tensor to tensor PCA;
(2) combine this with the completing hypergraphs technique of Section 11 to prove our main computational
lower bound for the hypothesis testing formulation of tensor PCA, Theorem 3.8; and (3) we show that
Theorem 3.8 implies computational lower bounds for the recovery formulation of tensor PCA. We remark
that the heuristic at the end of Section 4.3 yields the predicted computational barrier for TPCA. Specifically,
the `2 norm for the data tensor E[X] corresponding to k- HPCs is Θ(k s/2 ) which is Θ̃(ns/4 ) just below the
conjectured computational barrier for k- HPCs . Furthermore, the corresponding `2 norm for H1 of TPCAs is
Θ̃(θns/2 ). Equating these norms correctly predicts the computational barrier of θ = Θ̃(n−s/4 ).
Our reduction k- PST- TO - TPCA is shown in Figure 15, which applies dense Bernoulli rotations with Kro-
necker products of the matrices K2,t to the planted sub-tensor problem. The following theorem establishes
the approximate Markov transition properties of this reduction. Its proof is similar to the proofs of Theorems
10.2 and 14.2. We omit details that are similar for brevity.
Theorem 15.1 (Reduction to Tensor PCA). Fix initial and target parameters as follows:
• Initial k- PST Parameters: dimension N , sub-tensor size k that divides N , order s, a partition F of
[N ] into k parts of size N/k and edge probabilities 0 < q < p ≤ 1 where min{q, 1 − q} = ΩN (1).

127
Algorithm k- PST- TO - TPCA
⊗s
Inputs: k- PST instance T ∈ {0, 1}N of order s with planted sub-tensor size k that divides N , and the
following parameters

• partition F of [N ] into k parts of size N/k and edge probabilities 0 < q < p ≤ 1

• output dimension n and a parameter t ∈ N satisfying that

n ≤ D = 2k(2t − 1), N ≤ 2t k and t = O(log N )

• target level of signal θ ∈ (0, 1) where

c·δ
θ≤ p
· t + log(p − q)−1
2st/2
n    o
1−q
for a sufficiently small constant c > 0, where δ = min log pq , log 1−p .

t t
1. Pad: Form TPD ∈ {0, 1}2 k×2 k by embedding T as the upper left principal sub-tensor of TPD
and then adding 2t k − N new indices along each axis of T and filling all missing entries with
i.i.d. samples from Bern(q). Let Fi0 be Fi with 2t − N/k of the new indices. Sample k random
permutations σi of Fi0 independently for each 1 ≤ i ≤ k and permute the indices along each axis
of TPD within each part Fi0 according to σi .

2. Bernoulli Rotations: Let F 00 be a partition of [D] into k equally sized parts. Now compute the
⊗s
matrix TR ∈ RK as follows:

(1) For each block index (i1 , i2 , . . . , is ) ∈ [k], apply T ENSOR -B ERN -ROTATIONS to the tensor
(TPD )Fi0 ,Fi0 ,...,Fi0 with matrix parameters A1 = A2 = · · · = As = K2,t , rejection kernel
1 2 s
parameter RRK = (2t k)s , Bernoulli probabilities 0 < Q < p ≤ 1, output
√ dimension D/k =
2(2t − 1), singular value upper bounds λ1 = λ2 = · · · = λs = 2 and mean parameter
µ = θ · 2s(t+1)/2 .
(2) Set the entries of (TR )Fi00 ,Fi00 ,...,Fi00 to be the entries in order of the tensor output in (1).
1 2 s

3. Subsample, Sign and Output: Randomly choose a subset U ⊆ [D] of size |U | = n and randomly
sample a vector b ∼ Unif [{−1, 1}]⊗D output the tensor b⊗s TR restricted to the indices in U , or
in other words (b⊗s TR )U,U,...,U , where denotes the entrywise product of two tensors.

Figure 15: Reduction from k-partite Bernoulli planted sub-tensor to tensor PCA.

• Target TPCA Parameters: dimension n and a parameter t = t(N ) ∈ N satisfying that


n ≤ D = 2k(2t − 1), N ≤ 2t k and t = O(log N )
and target level of signal θ ∈ (0, 1) where
c·δ
θ≤ p
2st/2 · t + log(p − q)−1

128
n    o
1−q
for a sufficiently small constant c > 0, where δ = min log pq , log 1−p .

Let A(T ) denote k- PST- TO - TPCA applied to the tensor T with these parameters. Then A runs in poly(N )
time and it follows that
dTV A M[N ]s (S s , Bern(p), Bern(q)) , TPCAsD (n, θ) = O k −2s 2−2st
  
 ⊗s

dTV A M[N ]s (Bern(q)) , N (0, 1)⊗n = O k −2s 2−2st
 

for any set S ⊆ [N ] with |S ∩ Ei | = 1 for each 1 ≤ i ≤ k.


We now prove two lemmas stating the guarantees for the dense Bernoulli rotations step and final step
of k- PST- TO - TPCA. Define vS,F 0 ,F 00 (M ) as in Section 10.1. Note that the matrix K2,t has dimensions
2(2t − 1) × 2t . The proof of the next lemma follows from the same argument as in the proof of Lemma 10.3.
Lemma 15.2 (Bernoulli Rotations for TPCA). Let F 0 and F 00 be fixed partitions of [2t k] and [D] into k parts
of size 2t and 2(2t − 1), respectively, and let S ⊆ [2t k] where |S ∩ Fi0 | = 1 for each 1 ≤ i ≤ k. Let A2
denote Step 2 of k- PST- TO - TPCA with input TPD and output TR . Suppose that p, q and θ are as in Theorem
15.1, then it follows that

dTV A2 M[2t k]s (S s , Bern(p), Bern(q)) ,

 ⊗s

L 2st/2 θ · vS,F 0 ,F 00 (K2,t )⊗s + N (0, 1)⊗D = O k −2s 2−2st

 ⊗s

dTV A2 M[2t k]s (Bern(q)) , N (0, 1)⊗D = O k −2s 2−2st
 

Proof. This lemma follows from the same argument as in the proof of Lemma 10.3. We outline the details
that differ. Specifically, consider the case in which TPD ∼ M[2t k]s (S s , Bern(p), Bern(q)). Observe that
(TPD2 )Fi0 ,Fi0 ,...,Fi0 ∼ PB Fi01 × Fi02 × · · · × Fi0s , (S ∩ Fi01 , S ∩ Fi02 , . . . , S ∩ Fi0s ), p, q

1 2 s

for all (i1 , i2 , . . . , is ) ∈ [k]s . The singular value upper bound on K2,t in Lemma 8.5 and the same application
of Corollary 8.2 as in Lemma 10.3 yields that
  ⊗s

dTV (TR )Fi00 ,...,Fi00 , L 2−s/2 µ · (K2,t )·,S∩Fi0 ⊗ · · · ⊗ (K2,t )·,S∩Fi0 + N (0, 1)⊗(D/k) = O k −3s 2−2st

1 s 1 s
Qs
for all (i1 , i2 , . . . , is ) ∈ [k]s since j=1 λj = 2s/2 . Note that the exponent of 8 is guaranteed by changing
the parameter in Gaussian rejection kernels from n to n10 to decrease their total variation error. Note that
this step still runs in poly(n10 ) time. Combining this bound for all such (i1 , i2 , . . . , is ) and the tensorization
property of total variation in Fact 6.2 yields that
  ⊗s

dTV TR , L 2−s/2 µ · vS,F 0 ,F 00 (K2,t )⊗s + N (0, 1)⊗D = O k −2s 2−2st


Combining this with the fact that µ = θ · 2s(t+1)/2 now yields the first bound in the lemma. The second
bound follows by the same argument but now applying Corollary 8.2 to the distribution (TPD2 )Fi0 ,...,Fi0 ∼
1 s
⊗s
Bern(q)(D/k) . This completes the proof of the lemma.

Lemma 15.3 (Signing for TPCA). Let F 0 , F 00 and S be as in Lemma 15.2 and let p, q and θ be as in Theorem
15.1. Let A3 denote Step 3 of k- PST- TO - TPCA with input TR and output given by the output T 0 of A. Then
 ⊗s

A3 2st/2 θ · vS,F 0 ,F 00 (K2,t )⊗s + N (0, 1)⊗D ∼ TPCAsD (n, θ)
 ⊗s
 ⊗s
A3 N (0, 1)⊗D ∼ N (0, 1)⊗n

129
 ⊗s

Proof. Suppose that TR ∼ L 2st/2 θ · vS,F 0 ,F 00 (K2,t )⊗s + N (0, 1)⊗D and let b ∼ Unif [{−1, 1}]⊗D
be as in Step 3 of A. The symmetry of zero-mean Gaussians and independence among the entries of
⊗s
N (0, 1)⊗D imply that
 ⊗s
  ⊗s

b⊗s TR ∼ L 2st/2 θ · u⊗s + b⊗s N (0, 1)⊗D = L 2st/2 θ · u⊗s + N (0, 1)⊗D

⊗s
where u = b vS,F 0 ,F 00 (K2,t ) and the two terms u⊗s and N (0, 1)⊗D above are independent. Now
note that each entry of vS,F 0 ,F 00 (K2,t ) is either ±2−t/2 by the definition of K2,t . This implies that 2t/2 u is
distributed as Unif [{−1, 1}]⊗D and hence that
 ⊗s

L b⊗s TR = L θ · b⊗s + N (0, 1)⊗D = TPCAsD (D, θ)


Subsampling the same set U of n coordinates of this tensor along each axis by definition yields TPCA(n, θ),
⊗s
proving the first claim in the lemma. The second claim is immediate by the fact that if TR ∼ N (0, 1)⊗D
⊗s
then it also holds that b⊗s TR ∼ N (0, 1)⊗D . This completes the proof of the lemma.

We now complete the proof of Theorem 15.1 by applying Lemma 6.3 as in Theorems 10.2 and 14.2.

Proof of Theorem 15.1. Define the steps of A to map inputs to outputs as follows
A A A
1
(T, F ) −−→ 2
(TPD , F ) −−→ (TR , F 00 ) −→
3
T0

Consider Lemma 6.3 applied to the following sequence of distributions

P0 = M[N ]s (S s , Bern(p), Bern(q))


P1 = M[2t k]s (S s , Bern(p), Bern(q)) where S ∼ U2t k (F 0 )
⊗s
P2 = 2st/2 θ · vS,F 0 ,F 00 (K2,t )⊗s + N (0, 1)⊗D where S ∼ U2t k (F 0 )
P3 = TPCAsD (n, θ)

Consider applying Lemmas 15.2 and 15.3 while averaging over S ∼ U2t k (F 0 ) and applying the conditioning
property of total variation in Fact 6.2. This yields that we may take 1 = 0, 2 = O k −2s 2−2st and


3 = 0. Applying Lemma 6.3 proves the first bound in the theorem. Now consider the following sequence
of distributions
⊗s ⊗s
P0 = M[N ]s (Bern(q)) , P2 = N (0, 1)⊗D
P1 = M[2t k]s (Bern(q)) , and P3 = N (0, 1)⊗n

Lemmas 15.2 and 15.3 imply we can again take 1 = 0, 2 = O k −2s 2−2st and 3 = 0. The second bound


in the theorem now follows from Lemma 6.3.

We now apply this theorem to deduce our main computational lower bounds for tensor PCA by verifying
its guarantees are sufficient to apply Lemma 11.3.

Theorem 3.8 (Lower Bounds for TPCA). Let n be a parameter and s ≥ 3 be a constant, then the k- HPCs or
k- HPDSs conjecture for constant 0 < q < p ≤ 1 both imply a computational lower bound for TPCAs (n, θ)
at all levels of signal θ = õ(n−s/4 ) against poly(n) time algorithms A solving TPCAs (n, θ) with a low false
positive probability of PH0 [A(T ) = H1 ] = O(n−s ).

130
Proof. We will verify that the approximate Markov transition guarantees for k- PST- TO - TPCA in Theorem
15.1 are sufficient to apply Lemma 11.3 for the set of P = TPCAs (n, θ) with parameters (n, θ) that fill out
the region θ = õ(n−s/4 ). Fix a constant pair of probabilities 0 < Q < p ≤ 1, a constant positive integer s
and any sequence of parameters (n, θ) where θ ∈ (0, 1) is implicitly a function of n with
c
θ≤ √
ws/2 ns/4 log n
for sufficiently large n, an arbitrarily slow-growing function w = w(n) → ∞ and a sufficiently small
constant c > 0. Now consider the parameters (N, k) and input t to k- PST- TO - TPCA defined as follows:

• let t be such that 2t is the smallest power of two greater than w n; and

• let k = dw−1 ne and let N be the largest multiple of k less than n.

Now observe that these choices of parameters ensure that k divides N , it holds that k = o( N ) and

N ≤ n ≤ 2t k ≤ D = 2k(2t − 1)

Furthermore, we have that N = Θ(n) and 2t = Θ(w n). For a sufficiently small choice of c > 0, we also
have that
c c0 · δ
θ ≤ s/2 s/4 √ ≤ p
w n log n 2st/2 · t + log(p − Q)−1
where c0 > 0 is the constant and δ is as in Theorem 15.1. This verifies all of the conditions needed to
apply Theorem 15.1, which implies that k- PST- TO - TPCA maps s (N, k, p, Q) to TPCA s (n, θ) under
−2s −2st
 k- PST−2s
E
both H0 and H1 to within total variation error O k 2 = O(n ). By Lemma 11.3, the k- HPDSs
conjecture for k- HPDSsE (N 0 , k 0 , p, q) where N = N 0 − (s − 1)N 0 /k 0 and k = k 0 − s + 1 now implies
that there are is no poly(n) time algorithm A solving TPCAs (n, θ) with a low false positive probability of
PH0 [A(T ) = H1 ] = O(n−s ). This completes the proof of the theorem.

We conclude this section with the following lemma observing that this theorem implies a computational
lower bound for estimating v in TPCAs (n, θ) where θ = ω̃(n−s/2 ) and θ = õ(n−s/4 ). Note that the
requirement θ = ω̃(n−s/2 ) is weaker than the condition θ = ω̃(n(1−s)/2 ), which is necessary for recovering
v to be information-theoretically possible, as discussed in Section 3.8. The next lemma shows that any
estimator yields a test in the hypothesis testing formulation of tensor PCA that must have a low false positive
probability of error, since thresholding hv̂, T i where v̂ is an estimator of v, yields a means to distinguish
H0 and H1 with high probability. We remark that the requirement hv, v̂i = Ω(kvk2 ) is weaker than the
√ √
condition kv − v̂ · nk2 = o( n) when v̂ is a unit vector and v ∈ {−1, 1}n . Thus any estimation algorithm

with `2 error o( n), directly yields an algorithm AE satisfying the conditions of the lemma.

Lemma 15.4 (One-Sided Blackboxes from Estimation in Tensor PCA). Let s ≥ 2 be a fixed constant and
⊗s
suppose that there is a poly(n) time algorithm AE that, on input
√ sampled from θv ⊗s + N (0, 1)⊗n where
v ∈ {−1, 1}n is fixed but unknown to AE and θ = ω(n−s/2 s log n), outputs a unit vector v̂ ∈ Rn with
hv, v̂i = Ω(kvk2 ). Then there is a poly(n) time algorithm AD solving TPCAs (n, θ) with a low false positive
probability of PH0 [AD (T ) = H1 ] = O(n−s ).

Proof. Let T be an instance of TPCAs (n, θ) with T = θv ⊗s + G under H1 and T = G under H0 where
⊗s
G ∼ N (0, 1)⊗n . Consider the following algorithm AD for TPCAs (n, θ):
⊗s
1. Independently sample G0 ∼ N (0, 1)⊗n and form T1 = √1 (T
2
+ G0 ) and T2 = √1 (T
2
− G0 ).

2. Compute v̂(T1 ) as the output of AE applied to T1 .

131

3. Output H0 if hv̂(T1 )⊗s , T2 i < 2 s log n and output H1 otherwise.

First note that the entries of √12 (G + G0 ) and √12 (G − G0 ) are jointly Gaussian but uncorrelated, which
implies that these two tensors are independent. This implies that T1 and T2 are independent. Since v̂(T1 ) is
a unit vector and independent of T2 , it follows that hv̂(T1 )⊗s , T2 i is distributed as N (0, 1) conditioned on
v̂(T1 ) if T is distributed according to H0 of TPCAs (n, θ). Now we have that
h i
PH0 [AD (T ) = H1 ] = P hv̂(T1 )⊗s , T2 i ≥ 2 s log n = O(n−2s )
p

where the second equality follows from standard Gaussian tail bounds. If T is distributed according to
H1 , then hv̂(T1 )⊗s , T2 i ∼ N (θhv̂(T1 ), vis , 1).
√ In this case, AE ensures that hv̂(T1 ), vis = Ω(ns/2 ) since

kvk2 = n, and therefore θhv̂(T1 ), vis = ω( s log n). It therefore follows that
h i
⊗s
PH1 [AD (T ) = H0 ] ≤ P hv̂(T1 ) , T2 i − θhv̂(T1 ), vi < −2 s log n = O(n−2s )
s
p

Thus AD has Type I+II error that is o(1) and the desired low false positive probability, which completes the
proof of the lemma.

16 Universality of Lower Bounds for Learning Sparse Mixtures


In this section, we combine our reduction to ISGM from Section 10.1 with symmetric 3-ary rejection kernels,
which were introduced and analyzed in Section 7.3. We remark that the k-partite promise in k- PDS is
crucially used in our reduction to obtain this universality. In particular, this promise ensures that the entries
of the intermediate ISGM instance are from one of three distinct distributions, when conditioned on the part
of the mixture the sample is from. This is necessary for our application of symmetric 3-ary rejection kernels.
An overview of the ideas in this section can be found in Section 4.7.
Our general lower bound holds given tail bounds on the likelihood ratios between the planted and noise
distributions, and applies to a wide range of natural distributional formulations of learning sparse mixtures.
For example, our general lower bound recovers the tight computational lower bounds for sparse PCA in the
spiked covariance model from [GMZ17, BBH18, BB19b]. The results in this section can also be interpreted
as a universality principle for computational lower bounds in sparse PCA. We prove the approximate Markov
transition guarantees for our reduction to GLSM in Section 16.1 and discuss the universality conditions
needed for our lower bounds in Section 16.2.

16.1 Reduction to Generalized Learning Sparse Mixtures


In this section, we combine symmetric 3-ary rejection kernels with the reduction k- BPDS - TO - ISGM to map
from k- BPDS to generalized sparse mixtures. The details of this reduction k- BPDS - TO - GLSM are shown
in Figure 16. As mentioned in Sections 4.7 and 7.3, to reduce to sparse mixtures near their computational
barrier, it is crucial to produce multiple planted distributions. Previous rejection kernels do not have enough
degrees of freedom to map to three output distributions given their binary inputs. The symmetric 3-ary
rejection kernels introduced in Section 7.3 overcome this issue by mapping three input to three output
distributions. In particular, we will see in this section that their approximate Markov transition guarantees
established in Lemma 7.7 exactly lead to tight computational lower bounds for GLSM. Throughout this
section, we will adopt the definitions of GLSM and GLSMD introduced in Sections 3.9 and 6.3.
In order to establish computational lower bounds for GLSM, it is crucial to define a meaningful notion
of the level of signal in a set of target distributions D, Q and {Pν }ν∈R . This level of signal was defined in
Section 3.9 and is reviewed below for convenience. We remark that this definition will turn out to coincide

132
Algorithm k- BPDS - TO - GLSM
Inputs: Matrix M ∈ {0, 1}m×n , dense subgraph dimensions km and kn where kn divides n and the
following parameters

• partition F , edge probabilities 0 < q < p ≤ 1 and w(n) as in Figure 9

• target GLSM parameters (N, km , d) satisfying wN ≤ n and m ≤ d, a mixture distribution D and


target distributions {Pν }ν∈R and Q

1. Map to Gaussian Sparse Mixtures: Form the sample Z1 , Z2 , . . . , ZN ∈ Rd by setting

(Z1 , Z2 , . . . , ZN ) ← k- BPDS - TO - ISGM(M, F )

where k- BPDS - TO - ISGM is applied with r = 2, slow-growing


q function w(n), t = dlog2 (n/kn )e,
kn
target parameters (N, km , d),  = 1/2 and µ = c1 n log n for a sufficiently small constant c1 > 0.

2. Truncate and 3-ary Rejection Kernels: Sample ν1 , ν2 , . . . , νN ∼i.i.d. D, truncate the νi to lie within
[−1, 1] and form the vectors X1 , X2 , . . . , XN ∈ Rd by setting

Xij ← 3- SRK(TRτ (Zij ), Pνi , P−νi , Q)

for each i ∈ [N ] and j ∈ [d]. Here 3- SRK is applied with Nit = d4 log(dN )e iterations and with
the parameters
1
a = Φ(τ ) − Φ(−τ ), µ1 = (Φ(τ + µ) − Φ(τ − µ)) ,
2
1
µ2 = (2 · Φ(τ ) − Φ(τ + µ) − Φ(τ − µ))
2

3. Output: The vectors (X1 , X2 , . . . , XN ).

Figure 16: Reduction from k-part bipartite planted dense subgraph to general learning sparse mixtures.

with the conditions needed to apply symmetric 3-ary rejection kernels. This notion of signal also implicitly
defines the universality class over which our computational lower bounds hold.

Definition 3.9 (Universal Class and Level of Signal). Given a parameter N , define the collection of distri-
butions U = (D, Q, {Pν }ν∈R ) implicitly parameterized by N to be in the universality class UC(N ) if

• the pairs (Pν , Q) are all computable pairs, as in Definition 7.6, for all ν ∈ R;

• D is a symmetric distribution about zero and Pν∼D [ν ∈ [−1, 1]] = 1 − o(N −1 ); and

• there is a level of signal τU ∈ R such that for all ν ∈ [−1, 1] such that for any fixed constant K > 0,
it holds that

dPν dP−ν dPν dP−ν 2

dQ (x) − dQ (x) = ON (τU ) and dQ (x) + dQ (x) − 2 = ON τU

133
with probability at least 1 − O N −K over each of Pν , P−ν and Q.


In our reduction k- BPDS - TO - ISGM, we truncate Gaussians to generate the input distributions Tern. In
Figure 16, TRτ : R → {−1, 0, 1} denotes the truncation map given by

 1 if x > |τ |
TR τ (x) = 0 if − |τ | ≤ x ≤ |τ |
−1 if x < −|τ |

The following simple lemma on truncating symmetric triples of Gaussian distributions will be important in
the proofs in this section. Its proof is a direct computation and is deferred to Appendix B.2.
Lemma 16.1 (Truncating Gaussians). Let τ > 0 be constant, µ > 0 be tending to zero and let a, µ1 , µ2 be
such that
TR τ (N (µ, 1)) ∼ Tern(a, µ1 , µ2 )
TR τ (N (−µ, 1)) ∼ Tern(a, −µ1 , µ2 )
TRτ (N (0, 1)) ∼ Tern(a, 0, 0)
Then it follows that a > 0 is constant, 0 < µ1 = Θ(µ) and 0 < µ2 = Θ(µ2 ).
We now will prove our main approximate Markov transition guarantees for k- BPDS - TO - GLSM. The
proof follows from combining Theorem 10.2, Lemma 7.7 and an application of tensorization of dTV .
Theorem 16.2 (Reduction from k- BPDS to GLSM). Let n be a parameter and w(n) = ω(1) be a slow-
growing function. Fix initial and target parameters as follows:
• Initial k- BPDS Parameters: vertex counts on each side m and n that are polynomial in one another,
dense subgraph dimensions km and kn where kn divides n, constant edge probabilities 0 < q < p ≤ 1
and a partition F of [n].
0
• Target GLSM Parameters: (N, d) satisfying wN ≤ n, N ≥ nc for some constant c0 > 0 and m ≤
d ≤ poly(n), target distribution collection U = (D, Q, {Pν }ν∈R ) ∈ UC(N ) satisyfing that
s
kn
0 < τU ≤ c ·
n log n
for a sufficiently small constant c > 0.
Let A(M ) denote k- BPDS - TO - GLSM applied to the adjacency matrix M with these parameters. Then A
runs in poly(m, n) time and it follows that
dTV A M[m]×[n] (S × T, p, q) , GLSMD (N, S, d, U) = o(1) + O w−1 + kn−2 m−2 r−2t + n−2 + N −3 d−3
  
 
dTV A Bern(q)⊗m×n , Q⊗d×N = O kn−2 m−2 r−2t + n−2 + N −3 d−3
 

for all subsets S ⊆ [m] with |S| = km and subsets T ⊆ [n] with |T | = kn and |T ∩ Fi | = 1 for each
1 ≤ i ≤ kn .
Proof. Let A1 denote Step 1 of A with input M and output (Z1 , Z2 , . . . , ZN ). First note that 2t = Θ(n/kn )
by the definition of t and log m = Θ(log n) since m and n are polynomial in one another. Thus for a small
enough choice of c1 > 0, we have
s
2−(t+1)/2
    
kn p 1−q
µ = c1 · ≤ p · min log , log
n log n 2 6 log(kn m · 2t ) + 2 log(p − q)−1 q 1−p

134
since p and q are constants. Therefore µ satisfies the conditions needed to apply Theorem 10.2 to A1 . Now
let A2 denote Step 2 of A with input (Z1 , Z2 , . . . , ZN ) and output (X1 , X2 , . . . , XN ). First suppose that
(Z1 , Z2 , . . . , ZN ) ∼ ISGMD (N, S, d, µ, 1/2) or in other words where

Zi ∼i.i.d. MIX1/2 (N (µ · 1S , Id ), N (−µ · 1S , Id ))

For the next part of this argument, we condition on: (1) the entire vector ν = (ν1 , ν2 , . . . , νN ); and (2)
the subset P ⊆ [N ] of sample indices corresponding to the positive part N (µ · 1S , Id ) of the mixture. Let
C(ν, P ) denote the event corresponding to this conditioning. After truncating according to TRτ , by Lemma
16.1 the resulting entries are distributed as

 Tern(a, µ1 , µ2 ) if (i, j) ∈ S × P
TR τ (Zij ) ∼ Tern(a, −µ1 , µ2 ) if (i, j) ∈ S × P C
if i 6∈ S

Tern(a, 0, 0)

Furthermore, these entries are all independent conditioned on (ν, P ). Since τ is constant, Lemma 16.1 also
implies that a ∈ (0, 1) is constant, µ1 = Θ(µ) and µ2 = Θ(µ2 ). Let Sν be
 
dPν dP−ν 2|µ2 | dPν dP−ν
Sν = x ∈ X : 2|µ1 | ≥ (x) − (x) and
≥ (x) + (x) − 2
dQ dQ max{a, 1 − a} dQ dQ
as in Lemma 7.7. Since U = (D, Q, {Pν }ν∈R ) ∈ UC(N ) has level of signal τU ≤ c0 · µ for a sufficiently
small constant c0 > 0, we have by definition that {x ∈ Sνi } occurs with probability at least 1 − δ1 where
δ1 = O(n−4−K1 ) over each of Pνi , P−νi and Q, where K1 > 0 is a constant for which d = O(nK1 ). Here,
0
we are implicitly using the fact that N ≥ nc for some constant c0 > 0.
−1
√Now consider applying
−1
Lemma 7.7 to each application p of 3- SRK in Step 2 of A. Note that |µ1 | =
O( n log n) and |µ2 | = O(n log n) since µ = Ω( kn /n log n) and kn ≥ 1. Now consider the d-
dimensional vectors X10 , X20 , . . . , XN0 with independent entries distributed as


 Pνi if (i, j) ∈ S × P
0
Xij ∼ P if (i, j) ∈ S × P C
 −νi
Q if i 6∈ S
The tensorization property of dTV from Fact 6.2 implies that

dTV L(X1 , X2 , . . . , XN |ν, P ), L(X10 , X20 , . . . , XN


0

|ν, P )
N X
X d
dTV L(Xij |ν, P ), L(Xij0 |ν, P )


i=1 j=1
N X
X d
dTV 3- SRK(TRτ (Zij ), Pνi , P−νi , Q), L(Xij0 |ν, P )


i=1 j=1
"  Nit #
1
≤ N d 2δ1 1 + |µ1 |−1 + |µ2 | −1
+ δ1 1 + |µ1 |−1 + |µ2 |−1
 
+
2
= O n−2 + N −3 d−3


since N ≤ n, δ1 = O(n−4 d−1 ), Nit = d4 log(dN )e and by the total variation upper bounds in Lemma 7.7.
0 [N ]

We now will drop the conditioning on (ν, P ) and average over ν ∼ D and P ∼ Unif 2 . Observe
that, when not conditioned on (ν, P ), it holds that

(X10 , X20 , . . . , XN
0
) ∼ GLSMD N, S, d, D0 , Q, {Pν }ν∈R


135
where D0 is D conditioned to lie in [−1, 1]. Note that here we used the fact that D and therefore D0 is
symmetric about zero. Coupling the latent ν1 , ν2 , . . . , νN sampled from D and D0 and then applying the
tensorization property of Fact 6.2 yields that
dTV GLSMD N, S, d, D0 , Q, {Pν }ν∈R , GLSMD (N, S, d, (D, Q, {Pν }ν∈R ))
 

≤ dTV (D⊗n , D0⊗n ) ≤ N · dTV (D, D0 ) ≤ N · o(N −1 ) = o(1)


where dTV (D, D0 ) = o(N −1 ) follow from the conditioning property of dTV from Fact 6.2 and the fact that
Pν∼D [ν ∈ [−1, 1]] = 1 − o(N −1 ). The triangle inequality and conditioning property of dTV in Fact 6.2 now
imply that
dTV (A2 (ISGMD (N, S, d, µ, 1/2)) , GLSMD (N, S, d, U))
≤ dTV L(X1 , X2 , . . . , XN ), L(X10 , X20 , . . . , XN
0
) + dTV L(X10 , X20 , . . . , XN 0
 
), GLSMD (N, S, d, U)
≤ Eν∼D0 EP ∼Unif[2[N ] ] dTV L(X1 , X2 , . . . , XN |ν, P ), L(X10 , X20 , . . . , XN0

|ν, P )
+ dTV GLSMD N, S, d, D0 , Q, {Pν }ν∈R , GLSMD (N, S, d, U)
 

= o(1) + O n−2 + N −3 d−3




Now consider the case when Z1 , Z2 , . . . , ZN ∼i.i.d. N (0, Id ). Repeating the argument above with S = ∅
0 ) ∼ Q⊗N yields that
and observing that (X10 , X20 , . . . , XN
 
dTV A2 N (0, Id )⊗N , Q⊗d×N = O n−2 + N −3 d−3
 

We now apply Lemma 6.3 to the steps A1 and A2 under each of H0 and H1 , as in the proof of Theorem
10.2. Under H1 , consider Lemma 6.3 applied to the following sequence of distributions
P0 = M[m]×[n] (S × T, p, q), P1 = ISGMD (N, S, d, µ, 1/2) and P2 = GLSMD (N, S, d, U)
By Theorem 10.2 and the argument above, we can take
1 = O w−1 + kn−2 m−2 r−2t + n−2 + N −3 d−3 2 = o(1) + O n−2 + N −3 d−3
 
and
By Lemma 6.3, we therefore have that
dTV A M[m]×[n] (S × T, p, q) , GLSMD (N, S, d, U) = o(1)+O w−1 + kn−2 m−2 r−2t + n−2 + N −3 d−3
  

which proves the desired result in the case of H1 . Under H0 , similarly applying Theorem 10.2, the argument
above and Lemma 6.3 to the distributions
P0 = Bern(q)⊗m×n , P1 = N (0, Id )⊗N and P2 = Q⊗d×N
yields the total variation bound
 
dTV A Bern(q)⊗m×n , Q⊗d×N = O kn−2 m−2 r−2t + n−2 + N −3 d−3
 

which completes the proof of the theorem.

We now use this theorem to deduce our universality principle for lower bounds in GLSM. The proof of
this next theorem is similar to that of Theorems 3.1 and 13.4 and is deferred to Appendix B.2.
Theorem 3.10 (Computational
√ Lower Bounds for GLSM). Let n, k and d be polynomial in each other and
such that k = o( d). Suppose that the collections of distributions U = (D, Q, {Pν }ν∈R ) is in UC(n). Then
the k- BPC conjecture or k- BPDS conjecture for constant 0 < q < p ≤ 1 both imply a computational lower
bound for GLSM (n, k, d, U) at all sample complexities n = õ τU−4 .

136
16.2 The Universality Class UC(n) and Level of Signal τU
The result in Theorem 3.10 shows universality of the computational sample complexity of n = Ω̃(τU−4 ) for
learning sparse mixtures under the mild conditions of UC(n). In this section, we discuss this lower bound,
its implications, the universality class UC(n) and the level of signal τU .

Remarks on UC(n) and τU . The conditions for U = (D, Q, {Pν }ν∈R ) ∈ UC(n) and the definition of τU
have the following two notable properties.

• They are defined in terms of marginals: The class UC(n) and τU are defined entirely in terms of
the likelihood ratios dPν /dQ between the planted and non-planted marginals. In particular, they are
independent of the sparsity level k and other high-dimensional properties of the distribution GLSM
constructed from the Pν and Q. Theorem 3.10 thus establishes a computational lower bound for
GLSM at a sample complexity entirely based on properties of the marginals of Pν and Q.

• Their dependence on n is negligible: The parameter n only enters the definitions of UC(n) and τU
through requirements on tail probabilities. When the likelihood ratios dPν /dQ are relatively con-
centrated, the dependence of the conditions in UC(n) and τU on n is nearly negligible. If the ratios
dPν /dQ are concentrated under Pν and Q with exponentially decaying tails, then the tail probability
bound requirement of O(n−K ) only appears as a polylog(n) factor in τU . This will be the case in the
examples that appear later in this section.

D and Parameterization over [−1, 1]. D and the indices of Pν can be reparameterized without chang-
ing the underlying problem. The assumption that D is symmetric and mostly supported on [−1, 1] is for
notational convenience. As in the case of τU and the examples later in this section, the tail probability
requirement of o(n−1 ) for D only appears as a polylog(n) factor in the computational lower bound of
n = Ω̃(τU−4 ) if D is concentrated with exponential tails.
While the output vectors (X1 , X2 , . . . , XN ) of our reduction k- BPDS - TO - GLSM are independent, their
coordinates have dependence induced by the mixture D. The fact that our reduction samples the νi implies
that if these values were revealed to the algorithm, the problem would still remain hard: an algorithm for the
latter could be used together with the reduction to solve k-PC. However, even given the νi for the ith sample,
our reduction is such that whether the planted marginals in the ith sample are distributed according to Pνi
or P−νi remains unknown to the algorithm. Intuitively, our setup chooses to parameterize the distribution
D over [−1, 1] such that the sign ambiguity between Pνi or P−νi is what is producing hardness below the
sample complexity of n = Ω̃(τU−4 ).

Implications for Concentrated LLR. We now give several remarks on τU in the case that the log-
likelihood ratios (LLR) log dPν /dQ(x) are sufficiently well-concentrated if x ∼ Q or x ∼ Pν . Suppose
that U = (D, Q, {Pν }ν∈R ) ∈ UC(n), fix some arbitrarily large constant c > 0 and fix some ν ∈ [−1, 1]. If
SQ is the common support of the Pν and Q, define S to be
 
dPν dP−ν 2
dPν dP−ν
S = x ∈ SQ : c · τU ≥ (x) − (x) and c · τU ≥
(x) + (x) − 2
dQ dQ dQ dQ

Suppose that τU = Ω(n−K ) for some constant K > 0 and let c be large enough that S occurs with
probability at least 1 − O(n−K ) under each of Pν , P−ν and Q. Note that such a constant c is guaranteed by

137
Definition 3.9. Now observe that
 
1 dPν dP−ν
dTV (Pν , P−ν ) = · Ex∈Q (x) − (x)
2 dQ dQ
 
1 dPν dP−ν 1   1
(x) · 1S (x) + · Pν S C + · P−ν S C
 
≤ · Ex∈Q (x) −
2 dQ dQ 2 2
−K

≤ c · τU + O n = O (τU )

A similar calculation with the second condition defining S shows that

dTV MIX1/2 (Pν , P−ν ) , Q = O τU2


 

If the LLRs log dPν /dQ are sufficiently well-concentrated, then the random variables

dPν dP−ν and dPν (x) + dP−ν (x) − 2


dQ (x) − (x)
dQ dQ dQ

will also concentrate around their means if x ∼ Q. LLR concentration also implies that this is true if x ∼ Pν
or x ∼ P−ν . Thus, under sufficient concentration, the definition of the level of signal τU reduces to the much
more interpretable pair of upper bounds

dTV (Pν , P−ν ) = O (τU ) and dTV MIX1/2 (Pν , P−ν ) , Q = O τU2
 

These conditions directly measure the amount of statistical signal present in the planted marginals Pν . The
relevant calculations for an example application of Theorem 3.10 when the LLR concentrates is shown below
for sparse PCA. In [BBH19], various assumptions of concentration of the LLR and analogous implications
for computational lower bounds in submatrix detection are analyzed in detail. We refer the reader to Sections
3 and 9 of [BBH19] for the calculations needed to make the discussion here precise.
We remark that, assuming sufficient concentration on the LLR, the analysis of the k-sparse eigen-
value statistic from [BR13a] yields an information-theoretic upper bound for GLSM. Given GLSM samples
(X1 , X2 , . . . , Xn ), consider forming the LLR-processed samples Zi with
 
dPν
Zij = Eν∼D log (Xij )
dQ

for each i ∈ [n] and j ∈ [d]. Now consider taking the k-sparse eigenvalue of the samples Z1 , Z2 , . . . , Zn .
Under sub-Gaussianity assumptions on the Zij , the analysis in Theorem 2 of [BR13a] applies. Similarly, the
analysis in Theorem 5 of [BR13a] continues to hold, showing that the semidefinite programming algorithm
for sparse PCA yields an algorithmic upper bound for GLSM. As information-theoretic limits and algorithms
are not the focus of this paper, we omit the technical details needed to make this rigorous.
In many setups captured by GLSM such as sparse PCA, learning sparse mixtures of Gaussians and learn-
ing sparse mixtures of Rademachers, these analyses and our lower bound in Theorem 3.10 together yield a
k-to-k 2 statistical-computational gap. How our lower bound yields a k 2 dependence in the computational
barriers for these problems is discussed below.

Sparse PCA and Specific Distributions. One specific example captured by our universality principle
and that falls under the concentrated LLR setup discussed above is sparse PCA in the spiked covariance
model. The statistical-computational gaps of sparse PCA have been characterized based on the planted
clique conjecture in a line of work [BR13b, BR13a, WBS16, GMZ17, BBH18, BB19b]. We show that our
universality principle faithfully recovers the k-to-k 2 gap for sparse PCA shown in [BR13b, BR13a, WBS16,

138
GMZ17, BBH18] assuming the k- BPDS conjecture. As discussed in Section 12, also the k- BPC, k- PDS or
k- PC conjectures therefore yields nontrivial lower bounds. We remark that [BB19b] shows stronger hardness
based on weaker forms of the PC conjecture.
We show in the next lemma that sparse PCA corresponds to GLSM (n, k, d, U) for a proper choice of
U = (D, Q, {Pν }ν∈R ) ∈ UC(n) and τU so that the lower bound n = Ω̃(τU−4 ) exactly corresponds to the con-
jectured computational barrier in Sparse PCA. Recall that the hypothesis testing problem SPCA(n, k, d, θ)
has hypotheses
H0 : (X1 , X2 , . . . , Xn ) ∼i.i.d. N (0, Id )
 
H1 : (X1 , X2 , . . . , Xn ) ∼i.i.d. N 0, Id + θvv >

where v is a k-sparse d
√ unit vector in R chosen uniformly at random among all such vectors with nonzero
entries equal to 1/ k.
Lemma 16.3 (Lower Bounds for Sparse PCA). If, then SPCA(n, k, d, θ) can be expressed as GLSM(n, k, d, U)
where U = (D, Q, {Pν }ν∈R ) ∈ UC(n) is given by
r !  
θ log n 1
Pν = N 2ν , 1 for all ν ∈ R, Q = N (0, 1) and D = N 0,
k 4 log n
q 
θ(log n)2
and has valid level of signal τU = Θ k if it holds that θ(log n)2 = o(k).

Proof. Note that if X ∼ N 0, Id + θvv > then X can be written as



 
p 1
X = 2 θ log n · gv + G where g ∼ N 0, and G ∼ N (0, Id )
4 log n
and where g and G are independent. This follows from the fact that the random variable on the right-hand
side above is a jointly Gaussian vector with covariance matrix given by the sum of the covariance matrices of
the individual terms. This observation implies that SPCA(n, k, d, θ) is exactly the problem GLSM(n, k, d, U).
Now observe that the probability that x ∼ Dqsatisfies x ∈ [−1, 1] is 1 − o(n−1 ) by standard Gaussian tail
bounds. Fix some ν ∈ [−1, 1] and let t = 2ν θ log n
k . Note that

dPν dP −ν tx−t2 /2
−tx−t2 /2

dQ (x) − (x) = e − e = Θ (|tx|)
dQ

if |tx| = o(1). As long as x = O( log n), it follows√ that |tx| = O(τU ) = o(1) from the definition of τU
and fact that θ(log n)2 = o(k). Note that x = O( log n) occurs with probability at least 1 − O(n−K ) for
any constant K > 0 under each of Pν where ν ∈ [−1, 1] and Q by standard Gaussian tail bounds. Now
observe that
dPν dP−ν tx−t2 /2 −tx−t2 /2

(x) + (x) − 2 = e + e − 2 = Θ(t2 )

dQ dQ

holds if |tx| = o(1), which is true as long as x = O( log n) and thus holds with probability 1 − O(n−K )
for any fixed K > 0. Since t2 = O(τU2 ) for any ν ∈ [−1, 1], this completes the proof that U ∈ UC(n) with
level of signal τU .
Combining this lemma with Theorem 3.10 yields the k- BPDS conjecture implies a computational
√ lower
bound for Sparse PCA at the barrier n = õ(k 2 /θ2 ) as long as θ(log n)2 = o(k) and k = o( d), which
matches the planted clique lower bounds in [BR13b, BR13a, WBS16, GMZ17, BBH18]. Similar calcula-
tions to those in the above corollary can be used to identify the computational lower bound implied by
Theorem 3.10 for many other choices of U = (D, Q, {Pν }ν∈R ) ∈ UC(n). Some examples are:

139
• Balanced sparse Gaussian mixtures where Q = N (0, 1), Pν =√ N (θν,1) where
√ D is any symmetric
distribution over [−1, 1] can be shown to satisfy that τU = Θ θ log n if θ log n = o(1).

• The Bernoulli case where Q = Bern(1/2), Pν = Bern(1/2+θν) and D is any symmetric distribution
over [−1, 1] can be shown to satisfy that τU = Θ (θ) if θ ≤ 1/2.

• Sparse mixtures of exponential distributions where Q = Exp(λ), Pν = Exp(λ + θν)  and D is any
−1
symmetric distribution over [−1, 1] can be shown to satisfy that τU = Θ̃ θλ log n if it holds that
θ log n = o(λ).

• Sparse mixtures of centered Gaussians with difference variances where Q = N (0, 1), Pν = N (0, 1 +
θν) and D is any symmetric distribution over [−1, 1] can be shown to satisfy that τU = Θ (θ log n) if
θ log n = o(1).

We remark that τU can be calculated for many more choices of D, Q and Pν using the computations outlined
in the discussion above on the implications of our result for concentrated LLR.

17 Computational Lower Bounds for Recovery and Estimation


In this section, we outline several ways to deduce that our reductions to the hypothesis testing formulations
in the previous section imply computational lower bounds for natural recovery and estimation formulations
of the problems introduced in Section 3. We first introduce a notion of average-case reductions in total
variation between recovery problems and note that most of our reductions satisfy these stronger conditions
in addition to those in Section 6.2. We then discuss alternative methods of obtaining hardness of recovery
and estimation in the problems that we consider directly from computational lower bounds for detection.
In the previous section, we showed that lower bounds for our detection formulations of RSME and GLSM
directly imply lower bounds for natural estimation and recovery variants, respectively. In Section 15, we
showed that our lower bounds against blackboxes solving the detection formulation of tensor PCA with a
low false positive probability of error directly implies hardness of estimating v in `2 norm. As discussed in
Section 3.3, the problems of recovering the hidden partitions in GHPM and BHPM have very different barriers
than the testing problem we consider in this work. In this section, we will discuss recovery and estimation
hardness for the remaining problems from Section 3.

17.1 Our Reductions and Computational Lower Bounds for Recovery


Similar to the framework in Section 6.2 for reductions showing hardness of detection, there is a natural
notion of a reduction in total variation transferring computational lower bounds between recovery problems.
Let P(n, τ ) denote the recovery problem of estimating θ ∈ ΘP within some small loss `P (θ, θ̂) ≤ τ given
an observation from the distribution PD (θ). Here, n is any parameterization such that this observation has
size poly(n) and, as per usual, `P , ΘP and τ are implicitly functions of n. Define the problem P 0 (N, τ 0 )
analogously. The following is the definition of a reduction in total variation between P and P 0 .

Definition 17.1 (Reductions in Total Variation between Recovery Problems). A poly(n) time algorithm A
sending valid inputs for P(n, τ ) to valid inputs for P 0 (N, τ 0 ) is a reduction in total variation from P to P 0
if the following criteria are met for all θ ∈ ΘP :

1. There is a distribution D(θ) over ΘP 0 such that


0
(θ0 ) = on (1)

dTV A(PD (θ)), Eθ0 ∼D(θ) PD

140
2. There is a poly(n) time randomized algorithm B(X, θˆ0 ) mapping instances X of P(n, τ ) and θˆ0 ∈ ΘP 0
to θ̂ ∈ ΘP with the following property: if X ∼ PD (θ), θ0 is an arbitrary element of supp D(θ) and
θˆ0 is guaranteed to satisfy that `P 0 (θ0 , θˆ0 ) ≤ τ 0 , then B(X, θˆ0 ) outputs some θ̂ with `P (θ, θ̂) ≤ τ with
probability 1 − on (1).
While this definition has a number of technical conditions, it is conceptually simple. A randomized
algorithm A is a reduction in total variation from P to P 0 if it maps a sample from the conditional distribution
PD (θ) approximately to a sample from a mixture of PD (θ0 ), where the mixture is over a distribution D(θ)
determined by θ. Furthermore, there must be an efficient way B to recover a good estimate θ̂ of θ given
a good estimate θˆ0 of θ0 and the original instance X of P. The reason that (2) must be true for any θ0 ∈
supp D(θ) is that, to transfer recovery hardness from P to P 0 , the algorithm B will be applied to the output
θ0 of a blackbox solving P 0 applied to A(X). In this setting, θ0 and X are dependent and allowing θ0 ∈
supp D(θ) in the definition above accounts for this. Note that, as per usual, A must satisfy the properties in
the definition above oblivious to θ. The following lemma shows that Definition 17.1 fulfills its objective and
transfers hardness of recovery from P to P 0 . Its proof is simple and deferred to Appendix A.1.
Lemma 17.2. Suppose that there is reduction A from P(n, τ ) to P 0 (N, τ 0 ) satisfying the conditions in
Definition 17.1. If there is a polynomial time algorithm E 0 solving P 0 (N, τ 0 ) with probability at least p, then
there is a polynomial time algorithm E solving P(n, τ ) with probability at least p − on (1).
The recovery variants of the problems we consider all take the form of P(n, τ ). For example, ΘP is the
set of k-sparse vectors of bounded norm and `P is `2 in MSLR, and ΘP is the set of (n/k)-subsets of [n] and
`P is the size of the symmetric difference between two (n/k)-subsets in ISBM. In RSLR, ΘP can be taken
to be the set of al (u, A) where u is a k-sparse vector of bounded norm and A is a valid adversary. The
loss `P is then independent of A and given by the `2 norm on u. Throughout Parts II and III, the guarantees
we proved for our reductions among the hypothesis testing formulations from Section 6.3 generally took
the form of condition (1) in Definition 17.1. Some reductions had a post-processing step where coordinates
in the output instance are randomly permuted or subsampled, but these can simply be removed to yield a
guarantee matching the form of (1). In light of this and Lemma 17.2, it suffices to show that our reductions
also satisfy condition (2) in Definition 17.1. We outline how to construct these algorithms B for each of our
remaining problems below.

Reductions from BPC and k- BPC. All of our reductions from BPC and k- BPC to RSME, NEG - SPCA, MSLR
and RSLR map from an instance with left biclique vertex set S with |S| = km to an instance with hidden
−1/2
vector u = γ · km · 1S for some γ ∈ (0, 1). In the notation of Definition 17.1, D(S) is a point mass on u.
We now outline how such reductions imply hardness of estimation up to any `2 error τ 0 = o(γ).
To verify condition (2) of Definition 17.1, it suffices to give an efficient algorithm B recovering S and
the right biclique vertices S 0 from the original BPC or k- BPC instance G and an estimate û satisfying that
−1/2
kû − γ · km · 1S k2 ≤ τ 0 . Suppose that |S| = km and |S 0 | are both ω(log n). Let Ŝ be the set of the
−1/2
largest km entries of û and note that kγ −1 · û − km · 1S k2 = o(1), which can be verified to imply that
at least (1 − o(1))km of Ŝ must be in S. A union bound and Chernoff bound can be used to show that, in a
BPC instance with left and right biclique sets S and S 0 , there is no right vertex in [n]\S 0 with at least 3km /4
neighbors in S with probability 1 − on (1) if km  log n. Therefore S 0 is exactly the set of right vertices
with at least 5km /6 neighbors in Ŝ with probability 1 − on (1). Taking the common neighbors of S 0 now
recovers S with high probability. Thus this procedure of taking the km largest entries Ŝ of û, taking right
vertices with many neighbors in Ŝ and then taking their common neighborhoods, exactly solves the BPC
and k- BPC recovery problems. We remark that hardness for these exact recovery problems follows from
detection hardness, as bipartite Erdős-Rényi random graphs do not contain bicliques of left and right sizes
ω(log n) with probability 1 − on (1).

141
We remark that for the values of γ in our reductions, the condition τ = o(γ) implies tight computational
lower bounds for estimation in RSME, NEG - SPCA, MSLR and RSLR. In particular, for RSME, MSLR and
RSLR , we may take τ 0 to be arbitrarily close to τ in our detection lower bound as long as τ 0 = o(τ ). For
NEG - SPCA , a natural estimation analogue is to estimate some k-sparse v within `2 norm
√ τ 0 given n i.i.d.
samples from N (0, Id + vv > ). For this estimation formulation, we may take τ 0 = o( θ) where θ is as in
our detection lower bound.

Reductions from k- PC. We now outline how to construct such an algorithm B for ISBM. We only sketch
the details of this construction as a more direct and simpler way to deduce hardness of recovery for ISBM
will be discussed in the next section. We remark that a similar construction of B also verifies condition (2)
for our reduction to SEMI - CR.
For simplicity, first consider k- PDS - TO - ISBM without the initial T O -k-PARTITE -S UBMATRIX step and
the random permutations of vertex labels in Steps 2 and 4. Let S ⊆ [krt ] be the vertex set of the planted
dense subgraph in MPD2 and let F 0 and F 00 be the given partitions of the indices [krt ] of MPD2 and the
vertices [kr`] of the output graph, respectively. Lemma 14.3 shows that the output instance of ISBM has its
smaller hidden community C1 of size k` on the vertices corresponding to the negative entries of the vector
vS,F 0 ,F 00 (Kr,t ). Note that, as a function of this set S, the mixture distribution D(S) is again a point mass.
We now will outline how to approximately recover S given a close estimate Ĉ1 of C1 . Suppose that Ĉ1 is a
k`-subset of [kr`] such that |C1 ∩ Ĉ1 | ≥ (1 − o(1))k`. Construct the vector v̂ given by
(
1 1 if i 6∈ Ĉ1
v̂i = p ·
rt (r − 1) 1 − r if i ∈ Ĉ1

Since ` = Θ(rt−1 ), a direct calculation shows that kv̂ − vS,F 0 ,F 00 (Kr,t )k2 = o( k). For each part Fi00 ,
consider the vector in Rr` formed by restricting v̂ to the indices in Fi00 and identifying these indices with [r`]
in increasing order. For each such vector, find the closest column of Kr,t to this vector in `2 norm. If the
index of this column is j, add the jth smallest element of Fi0 to Ŝ. We claim that the resulting set Ŝ contains
at least (1 − o(1))k elements of S. The singular values of Kr,t computed in Lemma 8.5 can be used to show
that any two columns of Kr,t are separated by an `2 distance of Ω(1). Any part Fi0 for which the correct
j ∈ S ∩ Fi0 was not added to Ŝ must have satisfied that v̂ restricted to the part Fi00 was an√
`2 distance of Ω(1)
from the corresponding restriction of vS,F ,F (Kr,t ). Since kv̂ − vS,F ,F (Kr,t )k2 = o( k), the number of
0 00 0 00

such j incorrectly added to Ŝ is o(k), verifying the claim.


Now consider k- PDS - TO - ISBM with its first step and the random permutations. Since the random index
permutation in T O -k-PARTITE -S UBMATRIX and the subsequent random permutations in Steps 2 and 4
are all generated by the reduction, they can also be remembered and used in the algorithm B recovering
the clique of the input k- PC instance. When combined with the subroutine recovering Ŝ from Ĉ1 , these
permutations are sufficient to identify a set of k vertices overlapping with the clique in at least (1 − o(1))k
vertices. Now using a similar procedure to the one mentioned above for BPC, together with the input k- PC
instance G, this is sufficient to exactly recover the hidden clique vertices.

17.2 Relationship Between Detection and Recovery


As shown in the previous section, computational lower bounds from recovery can generally be deduced
from our reductions because they are also reductions in total variation between recovery problems. We now
will outline how our computational lower bounds for detection all either directly or almost directly imply
hardness of recovery. As in Section 10 of [BBH18], our approach is to produce two independent instances
X and X 0 from PD (θ) without knowing θ, to use X to recover an estimate θ̂ of θ and then to verify that
θ̂ is a good estimate of θ using X 0 . If θ̂ is confirmed to closely approximate θ using X 0 , then output H1 ,

142
and otherwise output H0 . This recipe shows detection is easier than recovery as long as there are efficient
ways to produce the pair (X, X 0 ) and to verify θ̂ is a good estimate given a fresh sample X 0 . In general, the
purpose of cloning into the pair (X, X 0 ) is to sidestep the fact that X and θ̂ are dependent random variables,
which complicates analyzing the verification step. In contrast, θ̂ and X 0 are conditionally independent given
θ. We now show that this recipe applies to each of our problems.

Sample Splitting. In problems with samples, a natural way to produce X and X 0 is to simply split the
set of samples into two groups. This yields a means to directly transfer computational lower bounds from
detection to recovery for RSME, NEG - SPCA, MSLR and RSLR. As we already discussed one way our reduc-
tions imply computational lower bounds for the recovery variants of these problems in the previous section,
we only sketch the main ideas here.
We first show an efficient algorithm for recovery in MSLR yields an efficient algorithm for detec-
tion. Consider the detection problem MSLR(2n, k, d, τ ), and assume there is a blackbox E solving the
recovery problem MSLR(n, k, d, τ 0 ) with probability 1 − on (1) for some τ 0 = o(τ ). If the samples from
MSLR (2n, k, d, τ ) are (X1 , y1 ), (X2 , y2 ), . . . , (X2n , y2n ), apply E to (X1 , y1 ), . . . , (Xn , yn ) to produce an
estimate û. Under H1 , there is some true u = τ ·k −1/2 ·1S for some k-set S and it holds that kû−uk2 = o(τ ).
As in the previous section, taking the largest k coordinates of û yields a set Ŝ containing at least (1 − o(1))k
elements of S. The idea is now that we almost know the true set S, detection using the second group of n
samples essentially reduces to MSLR without sparsity and is easy down to the information-theoretic limit.
More precisely, consider using the second half of the samples to form the statistic
2n
1 X 2
yi2 − 1 − τ 2 · (Xi )Ŝ , ûŜ


Z= 2
τ (1 + τ 2 )
i=n+1

where vŜ denotes the vector equal to v on the indices in Ŝ and zero elsewhere. Note that conditioned on
S, the second group of n samples is independent of Ŝ. Under H0 , it can be verified that E[Z] = 0 and
Var[Z] = O(n). Under H1 , it can be verified that kûk2 and kûŜ k2 are both (1 + o(1))τ and furthermore
that hu, ûŜ i ≥ (1 − o(1))τ 2 . Now note that since yi = Ri · hXi , ui + gi where gi ∼ N (0, 1) and Ri ∼ Rad,
we have that
2 2 2 2
yi2 − 1 − τ 2 · (Xi )Ŝ , ûŜ = hXi , ui2 · Xi , ûŜ − τ 2 · Xi , ûŜ + 2Ri gi · hXi , ui · Xi , ûŜ



2
+ (gi2 − 1) · Xi , ûŜ

The last two terms are mean zero and the second term has expectation −(1 + o(1))τ 4 since kûŜ k2 =
(1 + o(1))τ . Directly expanding the first term in terms of the components of Xi yields that its expectation
is given by 2hu, ûŜ i2 + kuk22 · kûŜ k22 ≥ 3(1 − o(1))τ 4 . Combining these computations yields that E[Z] ≥
2n(1 − o(1))τ 2 , and it can again be verified that Var[Z] = O(n). Chebyshev’s inequality now yields that

thresholding Z at nτ 2 distinguishes H0 and H1 as long as τ 2 n  1. Since the information-theoretic limit
of the detection formulation of MSLR is when n = Θ(k log d/τ 4 ) [FLWY18], whenever this problem is

possible it holds that τ 2 n  1. Therefore, whenever detection is possible, the reduction outlined above
shows how to produce a test solving detection in MSLR using an estimator with `2 error τ 0 = o(τ ).
Similar reductions transfer hardness of recovery to detection for NEG - SPCA, RSME and RSLR. For
NEG - SPCA and RSME , the same argument as above can be shown to work with the test statistic given by
Z = 2n 2
P
i=1 hXi , ûŜ i , and the same Z used above for MSLR suffices in the case of RSLR . We remark that
to show these statistics Z solve the detection variants of RSME and RSLR, it is important to use detection
formulations incorporating the exact form of our adversarial constructions, which are ISGM in the case of
RSME and the adversary described in Section 10 in the case of RSLR . An arbitrary adversary could corrupt

143
instances of RSME and RSLR to cause these statistics Z to not distinguish between H0 and H1 . Because
our detection lower bounds apply to these fixed adversaries rather than requiring an arbitrary adversary, this
argument yields the desired hardness of estimation for RSME and RSLR.

Post-Reduction Cloning. In problems without samples, producing the pair (X, X 0 ) requires an additional
reduction step. We now outline how to produce such a pair and verification step for ISBM. The high-level
idea is to stop our reduction to ISBM before the final thresholding step, apply Gaussian cloning as in Section
10 of [BBH18], then to continue the reduction with both copies, eventually using one to verify the output
of a recovery blackbox applied to the other. A similar argument can be used to show computational lower
bounds for recovery in SEMI - CR.
Consider the reduction k- PDS - TO - ISBM without the final thresholding step, outputting the matrix MR ∈
Rkr`×kr` at the end of Step 3. Now consider adding the following three steps to this reduction, given access
to a recovery blackbox E. More precisely, given an instance of ISBM(n, k, P11 , P12 , P22 ) with
γ γ
P11 = P0 + γ, P12 = P0 − and P22 = P0 +
k−1 (k − 1)2

as in Section 14.1, suppose E is guaranteed to output an (n/k)-subset of vertices Ĉ1 ⊆ [n] with |C1 ∩ Ĉ1 | ≥
(1 + )n/k 2 with probability 1 − on (1) for some  = Ω(1). Here, C1 is the true hidden smaller community
of the input ISBM instance. Observe that when  = Θ(1), the blackbox E has the weak guarantee of
recovering marginally more than a trivial 1/k fraction of C1 . This exactly matches the notion of weak
recovery discussed in Section 3.2.
1. Sample W ∼ N (0, 1)⊗n×n and form
1 1
MR1 = √ (MR + W ) and MR2 = √ (MR − W )
2 2

2. Using each of MR1 and MR2 , complete the reduction k- PDS - TO - ISBM omitting the random
√ permutation
in Step 4, and complete the additional steps from Corollary 14.5 replacing µ with µ/ 2. Let the two
output graphs be G1 and G2 .

3. Let Ĉ1 be the output of E applied to G1 . Output H0 if the subgraph of G2 restricted to Ĉ1 has at least
M edges, and output H1 otherwise.
We now outline how this solves the detection variant of ISBM. Let C1 be the true hidden smaller community
of the instance that k- PDS - TO - ISBM would produce if completed using MR . We claim that G1 and G2 are
o(1) total variation from independent copies of ISBM(n,√ C1 , P11 , P12 , P22 ) where P11 , P12 and P22 are as
above and γ is as in Corollary 14.5, but defined using µ/ 2 instead of µ. To see this, note that MR is o(1)
total variation from the distribution

0 µ(r − 1) > 1 1 if i 6∈ C1
MR = · v(C1 )v(C1 ) + Y where v(C1 )i = p ·
r t
r (r − 1) 1 − r if i ∈ C1

by Lemma 14.3, where Y ∼ N (0, 1)⊗n×n and t is the internal parameter used in k- PDS - TO - ISBM. Now it
follows that MR1 and MR2 are respectively o(1) total variation from
µ(r − 1)
0 1
MR1 √= · v(C1 )v(C1 )> + √ (Y + W ) and
r 2 2
0 µ(r − 1) 1
MR2 = · v(C1 )v(C1 )> +

√ √ (Y − W )
r 2 2

144
The entries of √12 (Y + W ) and √12 (Y − W ) are all jointly Gaussian and have variance 1. Furthermore,
they can all be verified to be0 uncorrelated,
0 implying that these two matrices are independent copies of
N (0, 1)⊗n×n and thus MR1 and MR2 are independent conditioned on C1 . Note that µ has essentially

been scaled down by a factor of 2 in both of these instances as well. Thus Step 2 above ensures that G1
and G2 are o(1) total variation from independent copies of ISBM(n, C1 , P11 , P12 , P22 ).
Now consider Step 3 above applied to two exact independent copies of ISBM(n, C1 , P11 , P12 , P22 ). The
guarantee for E ensures that |C1 ∩ Ĉ1 | ≥ (1+)n/k 2 with probability 1−on (1). The variance of the number
of edges in the subgraph of G2 restricted to Ĉ1 is O(n2 /k 2 ) under both H0 and H1 , and the expected number
n/k

of edges in this subgraph is P0 2 under H0 . Under H1 , the expected number of edges is
i |C ∩ Ĉ | n 
h
1 1
n 
k − |C1 ∩ Ĉ1 |
E |E(G[Ĉ1 ])| = P11 + |C1 ∩ Ĉ1 | · − |C1 ∩ Ĉ1 | P12 + P22
2 k 2
 
n/k γ  n 2 γ  n
= P0 + · k|C 1 ∩ Ĉ 1 | − − · (k − 2) · |C 1 ∩ Ĉ 1 | +
2 2(k − 1)2 k 2(k − 1) k
   2 2
n/k γ n
= P0 +Ω
2 k4

where the last bound holds since  = Ω(1) and k 2  n.


By Chebyshev’s inequality, Step 3 solves the hypothesis testing problem exactly when this difference
Ω(γ2 n2 /k 4 ) grows faster than the O(n/k) standard deviations in the number of edges in the subgraph under
H0 and H1 . This implies that Step 3 succeeds if it holds that γ2  k 3 /n. The Kesten-Stigum threshold
corresponds to γ 2 = Θ̃(k 2 /n) and therefore as long as 4 n = ω̃(k 4 ), this argument solves the detection
problem just below the Kesten-Stigum threshold. When  = Θ(1), this argument shows a computational
2
√ for weak recovery in ISBM. Since k = o(n) is√always true
lower bound up to the Kesten-Stigum threshold
in our formulation of ISBM, setting  = Θ( k) yields that for all k it is hard to recover a Θ(1/ k) fraction
of the hidden community C1 . This guarantee is much stronger than the analysis in the previous section,
which only showed hardness for a blackbox recovering a 1 − o(1) fraction of the hidden community. We
remark that the same trick used in Step 1 above to produce two independent copies of a matrix with Gaussian
noise was used to show estimation lower bounds for tensor PCA in Section 15.

Pre-Reduction Cloning. We remark that there is a general alternative method to obtain the pairs (X, X 0 )
in our reductions that we sketch here. Consider applying Bernoulli cloning either directly to the input PC or
PDS instance or to the output of T O -k-PARTITE -S UBMATRIX , in the case of reductions from k- PC , and then
running the remaining parts of our reductions on each of the two resulting copies. Ignoring post-processing
steps where we permute vertex labels or subsample the output instance, this general approach can be used to
yield two copies of the outputs of our reductions that have the same hidden structure and are conditionally
independent given this hidden structure. The same verification steps outlined above can then be applied to
obtain our computational lower bounds for recovery.

145
Acknowledgements
We are greatly indebted to Jerry Li for introducing the conjectured statistical-computational gap for robust
sparse mean estimation and for discussions that helped lead to this work. We thank Ilias Diakonikolas
for pointing out the statistical query model construction in [DKS17]. We thank the anonymous reviewers
for helpful feedback that greatly improved the exposition. We also thank Frederic Koehler, Sam Hopkins,
Philippe Rigollet, Enric Boix-Adserà, Dheeraj Nagaraj, Rares-Darius Buhai, Alex Wein, Ilias Zadik, Dylan
Foster and Austin Stromme for inspiring discussions on related topics. This work was supported in part by
MIT-IBM Watson AI Lab and NSF CAREER award CCF-1940205.

References
[AAK+ 07] Noga Alon, Alexandr Andoni, Tali Kaufman, Kevin Matulef, Ronitt Rubinfeld, and Ning Xie.
Testing k-wise and almost k-wise independence. In Proceedings of the thirty-ninth annual
ACM symposium on Theory of computing, pages 496–505. ACM, 2007.

[Abb17] Emmanuel Abbe. Community detection and stochastic block models: recent developments.
The Journal of Machine Learning Research, 18(1):6446–6531, 2017.

[ABH15] Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic
block model. IEEE Transactions on Information Theory, 62(1):471–487, 2015.

[ABL14] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for
efficiently learning linear separators with noise. In Proceedings of the forty-sixth annual
ACM symposium on Theory of computing, pages 449–458. ACM, 2014.

[ABX08] Benny Applebaum, Boaz Barak, and David Xiao. On basing lower-bounds for learning on
worst-case assumptions. In 2008 49th Annual IEEE Symposium on Foundations of Computer
Science, pages 211–220. IEEE, 2008.

[ACO08] Dimitris Achlioptas and Amin Coja-Oghlan. Algorithmic barriers from phase transitions. In
2008 49th Annual IEEE Symposium on Foundations of Computer Science, pages 793–802.
IEEE, 2008.

[ACV14] Ery Arias-Castro and Nicolas Verzelen. Community detection in dense random networks.
The Annals of Statistics, 42(3):940–969, 2014.

[AGHK14] Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham M Kakade. A tensor approach to
learning mixed membership community models. The Journal of Machine Learning Research,
15(1):2239–2312, 2014.

[AKS98] Noga Alon, Michael Krivelevich, and Benny Sudakov. Finding a large hidden clique in a
random graph. Random Structures and Algorithms, 13(3-4):457–466, 1998.

[Ame14] Brendan PW Ames. Guaranteed clustering and biclustering via semidefinite programming.
Mathematical Programming, 147(1-2):429–465, 2014.

[AS15] Emmanuel Abbe and Colin Sandon. Detection in the stochastic block model with multiple
clusters: proof of the achievability conjectures, acyclic bp, and the information-computation
gap. arXiv preprint arXiv:1512.09080, 2015.

146
[AS16] Emmanuel Abbe and Colin Sandon. Achieving the ks threshold in the general stochastic
block model with linearized acyclic belief propagation. In Advances in Neural Information
Processing Systems, pages 1334–1342, 2016.

[AS18] Emmanuel Abbe and Colin Sandon. Proof of the achievability conjectures for the general
stochastic block model. Communications on Pure and Applied Mathematics, 71(7):1334–
1406, 2018.

[ASW13] Martin Azizyan, Aarti Singh, and Larry Wasserman. Minimax theory for high-dimensional
gaussian mixtures with sparse mean separation. In Advances in Neural Information Process-
ing Systems, pages 2139–2147, 2013.

[ASW15] Martin Azizyan, Aarti Singh, and Larry Wasserman. Efficient sparse clustering of high-
dimensional non-spherical gaussian mixtures. In Artificial Intelligence and Statistics, pages
37–45, 2015.

[AV11] Brendan PW Ames and Stephen A Vavasis. Nuclear norm minimization for the planted clique
and biclique problems. Mathematical programming, 129(1):69–89, 2011.

[Bar17] Boaz Barak. The Complexity of Public-Key Cryptography, pages 45–77. Springer Interna-
tional Publishing, Cham, 2017.

[BB08] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in
neural information processing systems, pages 161–168, 2008.

[BB19a] Matthew Brennan and Guy Bresler. Average-case lower bounds for learning sparse mixtures,
robust estimation and semirandom adversaries. arXiv preprint arXiv:1908.06130, 2019.

[BB19b] Matthew Brennan and Guy Bresler. Optimal average-case reductions to sparse pca: From
weak assumptions to strong hardness. In Conference on Learning Theory, pages 469–470,
2019.

[BBH18] Matthew Brennan, Guy Bresler, and Wasim Huleihel. Reducibility and computational lower
bounds for problems with planted sparse structure. In COLT, pages 48–166, 2018.

[BBH19] Matthew Brennan, Guy Bresler, and Wasim Huleihel. Universality of computational lower
bounds for submatrix detection. In Conference on Learning Theory, pages 417–468, 2019.

[BBN19] Matthew Brennan, Guy Bresler, and Dheeraj Nagaraj. Phase transitions for detecting latent
geometry in random graphs. arXiv preprint arXiv:1910.14167, 2019.

[BCLS87] Thang Nguyen Bui, Soma Chaudhuri, Frank Thomson Leighton, and Michael Sipser. Graph
bisection algorithms with good average case behavior. Combinatorica, 7(2):171–191, 1987.

[BDER16] Sébastien Bubeck, Jian Ding, Ronen Eldan, and Miklós Z Rácz. Testing for high-dimensional
geometry in random graphs. Random Structures and Algorithms, 2016.

[BDLS17] Sivaraman Balakrishnan, Simon S Du, Jerry Li, and Aarti Singh. Computationally efficient
robust sparse estimation in high dimensions. pages 169–212, 2017.

[BG16] Sébastien Bubeck and Shirshendu Ganguly. Entropic clt and phase transition in high-
dimensional wishart matrices. International Mathematics Research Notices, 2018(2):588–
606, 2016.

147
[BGJ18] Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Algorithmic thresholds for tensor
pca. arXiv preprint arXiv:1808.00921, 2018.

[BHK+ 16] Boaz Barak, Samuel B Hopkins, Jonathan Kelner, Pravesh Kothari, Ankur Moitra, and Aaron
Potechin. A nearly tight sum-of-squares lower bound for the planted clique problem. In
Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages
428–437. IEEE, 2016.

[BI13] Cristina Butucea and Yuri I Ingster. Detection of a sparse submatrix of a high-dimensional
noisy matrix. Bernoulli, 19(5B):2652–2688, 2013.

[Bis06] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.

[BJR07] Béla Bollobás, Svante Janson, and Oliver Riordan. The phase transition in inhomogeneous
random graphs. Random Structures & Algorithms, 31(1):3–122, 2007.

[BKM+ 19] Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová. Optimal
errors and phase transitions in high-dimensional generalized linear models. Proceedings of
the National Academy of Sciences, 116(12):5451–5460, 2019.

[BKW19] Afonso S Bandeira, Dmitriy Kunisky, and Alexander S Wein. Computational hardness of
certifying bounds on constrained pca problems. arXiv preprint arXiv:1902.07324, 2019.

[BLM15] Charles Bordenave, Marc Lelarge, and Laurent Massoulié. Non-backtracking spectrum of
random graphs: community detection and non-regular ramanujan graphs. In 2015 IEEE 56th
Annual Symposium on Foundations of Computer Science, pages 1347–1357. IEEE, 2015.

[BMMN17] Gerard Ben Arous, Song Mei, Andrea Montanari, and Mihai Nica. The landscape of the
spiked tensor model. arXiv preprint arXiv:1711.05424, 2017.

[BMNN16] Jess Banks, Cristopher Moore, Joe Neeman, and Praneeth Netrapalli. Information-theoretic
thresholds for community detection in sparse networks. In Conference on Learning Theory,
pages 383–416, 2016.

[Bop87] Ravi B Boppana. Eigenvalues and graph bisection: An average-case analysis. In 28th Annual
Symposium on Foundations of Computer Science (sfcs 1987), pages 280–285. IEEE, 1987.

[BPW18] Afonso S Bandeira, Amelia Perry, and Alexander S Wein. Notes on computational-to-
statistical gaps: predictions using statistical physics. arXiv preprint arXiv:1803.11132, 2018.

[BR13a] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse princi-
pal component detection. In COLT, pages 1046–1066, 2013.

[BR13b] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in
high dimension. The Annals of Statistics, 41(4):1780–1815, 2013.

[BRT09] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso
and dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.

[BS95] Avrim Blum and Joel Spencer. Coloring random and semi-random k-colorable graphs. Jour-
nal of Algorithms, 19(2):204–234, 1995.

[BS04] Béla Bollobás and Alex D Scott. Max cut for random graphs with a planted partition. Com-
binatorics, Probability and Computing, 13(4-5):451–474, 2004.

148
[BT06a] Andrej Bogdanov and Luca Trevisan. On worst-case to average-case reductions for np prob-
lems. SIAM Journal on Computing, 36(4):1119–1159, 2006.

[BT+ 06b] Andrej Bogdanov, Luca Trevisan, et al. Average-case complexity. Foundations and Trends
R
in Theoretical Computer Science, 2(1):1–106, 2006.

[BVH+ 16] Afonso S Bandeira, Ramon Van Handel, et al. Sharp nonasymptotic bounds on the norm
of random matrices with independent entries. The Annals of Probability, 44(4):2479–2506,
2016.

[BWY+ 17] Sivaraman Balakrishnan, Martin J Wainwright, Bin Yu, et al. Statistical guarantees for the em
algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120,
2017.

[CC18] Utkan Onur Candogan and Venkat Chandrasekaran. Finding planted subgraphs with few
eigenvalues using the schur–horn relaxation. SIAM Journal on Optimization, 28(1):735–759,
2018.

[CCM13] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regression under
adversarial corruption. In International Conference on Machine Learning, pages 774–782,
2013.

[CCT12] Kamalika Chaudhuri, Fan Chung, and Alexander Tsiatas. Spectral clustering of graphs with
general degrees in the extended planted partition model. In Conference on Learning Theory,
pages 35–1, 2012.

[CGP+ 19] Wei-Kuo Chen, David Gamarnik, Dmitry Panchenko, Mustazee Rahman, et al. Suboptimality
of local algorithms for a class of max-cut problems. The Annals of Probability, 47(3):1587–
1618, 2019.

[CH10] Yao-ban Chan and Peter Hall. Using evidence of mixed populations to select variables
for clustering very high-dimensional data. Journal of the American Statistical Association,
105(490):798–809, 2010.

[Che15] Yudong Chen. Incoherence-optimal matrix completion. IEEE Transactions on Information


Theory, 61(5):2909–2923, 2015.

[Che19] Wei-Kuo Chen. Phase transition in the spiked random tensor with rademacher prior. The
Annals of Statistics, 47(5):2734–2756, 2019.

[CHL18] Wei-Kuo Chen, Madeline Handschy, and Gilad Lerman. Phase transition in random tensors
with multiple spikes. arXiv preprint arXiv:1809.06790, 2018.

[CJ13] Venkat Chandrasekaran and Michael I Jordan. Computational and statistical tradeoffs via
convex relaxation. Proceedings of the National Academy of Sciences, 110(13):E1181–E1190,
2013.

[CK01] Anne Condon and Richard M Karp. Algorithms for graph partitioning on the planted partition
model. Random Structures & Algorithms, 18(2):116–140, 2001.

[CK09] Hyonho Chun and Sündüz Keles. Expression quantitative trait loci mapping with multivariate
sparse partial least squares regression. Genetics, 2009.

149
[CL13] Arun Tejasvi Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear
regressions. In International Conference on Machine Learning, pages 1040–1048, 2013.
[CLM16] T Tony Cai, Xiaodong Li, and Zongming Ma. Optimal rates of convergence for noisy sparse
phase retrieval via thresholded wirtinger flow. The Annals of Statistics, 44(5):2221–2251,
2016.
[CLM18] Francesco Caltagirone, Marc Lelarge, and Léo Miolane. Recovering asymmetric communi-
ties in the stochastic block model. IEEE Transactions on Network Science and Engineering,
5(3):237–246, 2018.
[CLR15] Tony Cai, Tengyuan Liang, and Alexander Rakhlin. Computational and statistical boundaries
for submatrix localization in a large noisy matrix. arXiv preprint arXiv:1502.01988, 2015.
[CLS15] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger
flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007,
2015.
[CMW15] Tony Cai, Zongming Ma, and Yihong Wu. Optimal estimation and rank detection for sparse
spiked covariance matrices. Probability theory and related fields, 161(3-4):781–815, 2015.
[CMW20] Michael Celentano, Andrea Montanari, and Yuchen Wu. The estimation error of general first
order methods. arXiv preprint arXiv:2002.12903, 2020.
[CO10] Amin Coja-Oghlan. Graph partitioning via adaptive spectral techniques. Combinatorics,
Probability and Computing, 19(2):227–284, 2010.
[CSV17] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages
47–60. ACM, 2017.
[CSX12] Yudong Chen, Sujay Sanghavi, and Huan Xu. Clustering sparse graphs. In Advances in
neural information processing systems, pages 2204–2212, 2012.
[CSX14] Yudong Chen, Sujay Sanghavi, and Huan Xu. Improved graph clustering. IEEE Transactions
on Information Theory, 60(10):6440–6455, 2014.
[CSX17] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial
settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis
of Computing Systems, 1(2):1–25, 2017.
[CW18] T. Tony Cai and Yihong Wu. Statistical and computational limits for sparse matrix detection.
arXiv preprint arXiv:1801.00518, 2018.
[CW+ 19] Didier Chételat, Martin T Wells, et al. The middle-scale asymptotics of wishart matrices. The
Annals of Statistics, 47(5):2639–2670, 2019.
[CX16] Yudong Chen and Jiaming Xu. Statistical-computational tradeoffs in planted problems and
submatrix localization with a growing number of clusters and submatrices. Journal of Ma-
chine Learning Research, 17(27):1–57, 2016.
[CYC14] Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formulation for mixed
regression with two components: Minimax optimal rates. In Conference on Learning Theory,
pages 560–604, 2014.

150
[CYC17] Yudong Chen, Xinyang Yi, and Constantine Caramanis. Convex and nonconvex formulations
for mixed regression with two components: Minimax optimal rates. IEEE Transactions on
Information Theory, 64(3):1738–1766, 2017.

[DAM15] Yash Deshpande, Emmanuel Abbe, and Andrea Montanari. Asymptotic mutual information
for the two-groups stochastic block model. arXiv preprint arXiv:1507.08685, 2015.

[DF80] Persi Diaconis and David Freedman. Finite exchangeable sequences. The Annals of Proba-
bility, pages 745–764, 1980.

[DF89] Martin E. Dyer and Alan M. Frieze. The solution of some random np-hard problems in
polynomial expected time. Journal of Algorithms, 10(4):451–489, 1989.

[DF16] Roee David and Uriel Feige. On the effect of randomness on planted 3-coloring models.
In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages
77–90. ACM, 2016.

[DGGP14] Yael Dekel, Ori Gurel-Gurevich, and Yuval Peres. Finding hidden cliques in linear time with
high probability. Combinatorics, Probability and Computing, 23(1):29–49, 2014.

[DGK+ 10] Yevgeniy Dodis, Shafi Goldwasser, Yael Tauman Kalai, Chris Peikert, and Vinod Vaikun-
tanathan. Public-key encryption schemes with auxiliary inputs. In Theory of Cryptography
Conference, pages 361–381. Springer, 2010.

[DGR00] Scott E Decatur, Oded Goldreich, and Dana Ron. Computational sample complexity. SIAM
Journal on Computing, 29(3):854–879, 2000.

[DHL19] Yihe Dong, Samuel Hopkins, and Jerry Li. Quantum entropy scoring for fast robust mean
estimation and improved outlier detection. In Advances in Neural Information Processing
Systems, pages 6065–6075, 2019.

[DKK+ 16] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair
Stewart. Robust estimators in high dimensions without the computational intractability. In
2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages
655–664. IEEE, 2016.

[DKK+ 18] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair
Stewart. Robustly learning a gaussian: Getting optimal error, efficiently. In Proceedings of
the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2683–2702.
Society for Industrial and Applied Mathematics, 2018.

[DKK+ 19] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair
Stewart. Sever: A robust meta-algorithm for stochastic optimization. In International Con-
ference on Machine Learning, pages 1596–1606, 2019.

[DKMZ11] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová. Asymptotic
analysis of the stochastic block model for modular networks and its algorithmic applications.
Physical Review E, 84(6):066106, 2011.

[DKS17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for
robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th
Annual Symposium on Foundations of Computer Science (FOCS), pages 73–84. IEEE, 2017.

151
[DKS19] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart. Efficient algorithms and lower bounds
for robust linear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium
on Discrete Algorithms, pages 2745–2754. SIAM, 2019.

[DLSS14] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. From average case complexity to im-
proper learning complexity. pages 441–448, 2014.
p
[DM15a] Yash Deshpande and Andrea Montanari. Finding hidden cliques of size N/e in nearly linear
time. Foundations of Computational Mathematics, 15(4):1069–1128, 2015.

[DM15b] Yash Deshpande and Andrea Montanari. Improved sum-of-squares lower bounds for hidden
clique and hidden submatrix problems. pages 523–562, 2015.

[DSS16] Amit Daniely and Shai Shalev-Shwartz. Complexity theoretic limitations on learning DNF’s.
pages 815–830, 2016.

[DV89] Richard D De Veaux. Mixtures of linear regressions. Computational Statistics & Data Anal-
ysis, 8(3):227–245, 1989.

[EM16] Ronen Eldan and Dan Mikulincer. Information and dimensionality of anisotropic random
geometric graphs. arXiv preprint arXiv:1609.02490, 2016.

[Fei02] Uriel Feige. Relations between average case complexity and approximation complexity. In
Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 534–
543. ACM, 2002.

[FF93] Joan Feigenbaum and Lance Fortnow. Random-self-reducibility of complete sets. SIAM
Journal on Computing, 22(5):994–1005, 1993.

[FGR+ 13] Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh Vempala, and Ying Xiao. Statistical
algorithms and a lower bound for detecting planted cliques. In Proceedings of the forty-fifth
annual ACM symposium on Theory of computing, pages 655–664. ACM, 2013.

[FK81] Zoltán Füredi and János Komlós. The eigenvalues of random symmetric matrices. Combina-
torica, 1(3):233–241, 1981.

[FK00] Uriel Feige and Robert Krauthgamer. Finding and certifying a large hidden clique in a semi-
random graph. Random Structures and Algorithms, 16(2):195–208, 2000.

[FK01] Uriel Feige and Joe Kilian. Heuristics for semirandom graph problems. Journal of Computer
and System Sciences, 63(4):639–671, 2001.

[FLWY18] Jianqing Fan, Han Liu, Zhaoran Wang, and Zhuoran Yang. Curse of heterogeneity: Computa-
tional barriers in sparse mixture models and phase retrieval. arXiv preprint arXiv:1808.06996,
2018.

[FO05] Uriel Feige and Eran Ofek. Spectral techniques applied to sparse random graphs. Random
Structures & Algorithms, 27(2):251–275, 2005.

[FPV15] Vitaly Feldman, Will Perkins, and Santosh Vempala. On the complexity of random satisfia-
bility problems with planted solutions. pages 77–86, 2015.

152
[FR10] Uriel Feige and Dorit Ron. Finding hidden cliques in linear time. In 21st International Meet-
ing on Probabilistic, Combinatorial, and Asymptotic Methods in the Analysis of Algorithms
(AofA’10), pages 189–204. Discrete Mathematics and Theoretical Computer Science, 2010.

[FS10] Susana Faria and Gilda Soromenho. Fitting mixtures of linear regressions. Journal of Statis-
tical Computation and Simulation, 80(2):201–225, 2010.

[Gao20] Chao Gao. Robust regression via mutivariate regression depth. Bernoulli, 26(2):1139–1170,
2020.

[GCS+ 13] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B
Rubin. Bayesian data analysis. Chapman and Hall/CRC, 2013.

[GKPV10] Shafi Goldwasser, Yael Kalai, Chris Peikert, and Vinod Vaikuntanathan. Robustness of the
learning with errors assumption. In Innovations in Computer Science, pages 230–240, 2010.

[GMZ17] Chao Gao, Zongming Ma, and Harrison H Zhou. Sparse cca: Adaptive estimation and com-
putational barriers. The Annals of Statistics, 45(5):2074–2101, 2017.

[Gol11] Oded Goldreich. Notes on Levin’s theory of average-case complexity. In Studies in Complex-
ity and Cryptography. Miscellanea on the Interplay between Randomness and Computation,
pages 233–247. Springer, 2011.

[Gri01] Dima Grigoriev. Linear lower bound on degrees of positivstellensatz calculus proofs for the
parity. Theoretical Computer Science, 259(1-2):613–622, 2001.

[GS+ 17] David Gamarnik, Madhu Sudan, et al. Limits of local algorithms over sparse random graphs.
The Annals of Probability, 45(4):2353–2376, 2017.

[GZ17] David Gamarnik and Ilias Zadik. High dimensional regression with binary coefficients. es-
timating squared error and a phase transtition. In Conference on Learning Theory, pages
948–953, 2017.

[GZ19] David Gamarnik and Ilias Zadik. The landscape of the planted clique problem: Dense sub-
graphs and the overlap gap property. arXiv preprint arXiv:1904.07174, 2019.

[HKP+ 17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, Tselil
Schramm, and David Steurer. The power of sum-of-squares for detecting hidden structures.
Proceedings of the fifty-eighth IEEE Foundations of Computer Science, pages 720–731, 2017.

[HKP+ 18] Samuel B Hopkins, Pravesh Kothari, Aaron Henry Potechin, Prasad Raghavendra, and Tselil
Schramm. On the integrality gap of degree-4 sum of squares for planted clique. ACM Trans-
actions on Algorithms (TALG), 14(3):1–31, 2018.

[HL19] Samuel B Hopkins and Jerry Li. How hard is robust mean estimation? arXiv preprint
arXiv:1903.07870, 2019.

[HLL83] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmod-
els: First steps. Social networks, 5(2):109–137, 1983.

[HLV18] Paul Hand, Oscar Leong, and Vlad Voroninski. Phase retrieval under a generative prior. In
Advances in Neural Information Processing Systems, pages 9136–9146, 2018.

153
[Hop18] Samuel B Hopkins. Statistical Inference and the Sum of Squares Method. PhD thesis, Cornell
University, 2018.

[HS17] Samuel B Hopkins and David Steurer. Efficient bayesian estimation from few samples: com-
munity detection and related problems. In Foundations of Computer Science (FOCS), 2017
IEEE 58th Annual Symposium on, pages 379–390. IEEE, 2017.

[HSS15] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis
via sum-of-square proofs. In COLT, pages 956–1006, 2015.

[HSSS16] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algo-
rithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In
Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 178–
191, 2016.

[Hub65] Peter J Huber. A robust version of the probability ratio test. The Annals of Mathematical
Statistics, pages 1753–1758, 1965.

[Hub92] Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics,


pages 492–518. Springer, 1992.

[Hub11] Peter J Huber. Robust statistics. Springer, 2011.

[HW20] Justin Holmgren and Alexander Wein. Counterexamples to the low-degree conjecture. arXiv
preprint arXiv:2004.08454, 2020.

[HWX15] Bruce E Hajek, Yihong Wu, and Jiaming Xu. Computational lower bounds for community
detection on random graphs. In COLT, pages 899–928, 2015.

[HWX16a] Bruce Hajek, Yihong Wu, and Jiaming Xu. Achieving exact cluster recovery threshold via
semidefinite programming. IEEE Transactions on Information Theory, 62(5):2788–2797,
2016.

[HWX16b] Bruce Hajek, Yihong Wu, and Jiaming Xu. Achieving exact cluster recovery threshold
via semidefinite programming: Extensions. IEEE Transactions on Information Theory,
62(10):5918–5937, 2016.

[HWX16c] Bruce Hajek, Yihong Wu, and Jiaming Xu. Information limits for recovering a hidden com-
munity. pages 1894–1898, 2016.

[JJ94] Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm.
Neural computation, 6(2):181–214, 1994.

[JL04] Iain M Johnstone and Arthur Yu Lu. Sparse principal components analysis. Unpublished
manuscript, 2004.

[JL15] Tiefeng Jiang and Danning Li. Approximation of rectangular beta-laguerre ensembles and
large deviations. Journal of Theoretical Probability, 28(3):804–847, 2015.

[JLM18] Aukosh Jagannath, Patrick Lopatto, and Leo Miolane. Statistical thresholds for tensor pca.
arXiv preprint arXiv:1812.03403, 2018.

[JM15] Michael I Jordan and Tom M Mitchell. Machine learning: Trends, perspectives, and
prospects. Science, 349(6245):255–260, 2015.

154
[Kar77] Richard M Karp. Probabilistic analysis of partitioning algorithms for the traveling-salesman
problem in the plane. Mathematics of operations research, 2(3):209–224, 1977.

[KKK19] Sushrut Karmalkar, Adam Klivans, and Pravesh Kothari. List-decodable linear regression. In
Advances in Neural Information Processing Systems, pages 7423–7432, 2019.

[KKM18] Adam Klivans, Pravesh K Kothari, and Raghu Meka. Efficient algorithms for outlier-robust
regression. In Conference On Learning Theory, pages 1420–1430, 2018.

[KMM11] Alexandra Kolla, Konstantin Makarychev, and Yury Makarychev. How to play unique games
against a semi-random adversary: Study of semi-random models of unique games. In 2011
IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 443–452. IEEE,
2011.

[KMMP19] Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Sample
complexity of learning mixture of sparse linear regressions. In Advances in Neural Informa-
tion Processing Systems, pages 10531–10540, 2019.

[KMOW17] Pravesh K Kothari, Ryuhei Mori, Ryan O’Donnell, and David Witmer. Sum of squares lower
bounds for refuting any csp. arXiv preprint arXiv:1701.04521, 2017.

[KMRT+ 07] Florent Krzakała, Andrea Montanari, Federico Ricci-Tersenghi, Guilhem Semerjian, and
Lenka Zdeborová. Gibbs states and the set of solutions of random constraint satisfaction
problems. Proceedings of the National Academy of Sciences, 104(25):10318–10323, 2007.

[KR19] Yael Tauman Kalai and Leonid Reyzin. A survey of leakage-resilient cryptography. In Pro-
viding Sound Foundations for Cryptography: On the Work of Shafi Goldwasser and Silvio
Micali, pages 727–794. 2019.

[Kuč77] L Kučera. Expected behavior of graph coloring algorithms. In International Conference on


Fundamentals of Computation Theory, pages 447–451. Springer, 1977.

[KWB19] Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira. Notes on computational hard-
ness of hypothesis testing: Predictions using the low-degree likelihood ratio. arXiv preprint
arXiv:1907.11636, 2019.

[KZ14] Pascal Koiran and Anastasios Zouzias. Hidden cliques and the certification of the restricted
isometry property. IEEE Transactions on Information Theory, 60(8):4999–5006, 2014.

[LDBB+ 16] Thibault Lesieur, Caterina De Bacco, Jess Banks, Florent Krzakala, Cris Moore, and Lenka
Zdeborová. Phase transitions and optimal algorithms in high-dimensional Gaussian mixture
clustering. In Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton
Conference on, pages 601–608. IEEE, 2016.

[Lev86] Leonid A Levin. Average case complete problems. SIAM Journal on Computing, 15(1):285–
286, 1986.

[Li17] Jerry Li. Robust sparse estimation tasks in high dimensions. arXiv preprint
arXiv:1702.05860, 2017.

[Lin92] Nathan Linial. Locality in distributed graph algorithms. SIAM Journal on Computing,
21(1):193–201, 1992.

155
[LKZ15] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. Mmse of probabilistic low-rank
matrix estimation: Universality with respect to the output channel. In Communication, Con-
trol, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, pages 680–687.
IEEE, 2015.

[LL18] Yuanzhi Li and Yingyu Liang. Learning mixtures of linear regressions with nearly optimal
complexity. In Conference On Learning Theory, pages 1125–1144, 2018.

[LLC19] Liu Liu, Tianyang Li, and Constantine Caramanis. High dimensional robust estimation of
sparse models via trimmed hard thresholding. arXiv preprint arXiv:1901.08237, 2019.

[LLV17] Can M Le, Elizaveta Levina, and Roman Vershynin. Concentration and regularization of
random graphs. Random Structures & Algorithms, 51(3):538–561, 2017.

[LML+ 17] Thibault Lesieur, Léo Miolane, Marc Lelarge, Florent Krzakala, and Lenka Zdeborová. Sta-
tistical and computational phase transitions in spiked tensor estimation. In 2017 IEEE Inter-
national Symposium on Information Theory (ISIT), pages 511–515. IEEE, 2017.

[LP13] Linyuan Lu and Xing Peng. Spectra of edge-independent random graphs. The Electronic
Journal of Combinatorics, 20(4):P27, 2013.

[LRV16] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covari-
ance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS),
pages 665–674. IEEE, 2016.

[LSLC18] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional robust
sparse regression. arXiv preprint arXiv:1805.11643, 2018.

[LV13] Xiaodong Li and Vladislav Voroninski. Sparse signal recovery from quadratic measurements
via convex programming. SIAM Journal on Mathematical Analysis, 45(5):3019–3033, 2013.

[Maj09] A Majumdar. Image compression by sparse pca coding in curvelet domain. Signal, image
and video processing, 3(1):27–34, 2009.

[Mas14] Laurent Massoulié. Community detection thresholds and the weak ramanujan property. In
Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 694–
703. ACM, 2014.

[MCMM09] Cathy Maugis, Gilles Celeux, and Marie-Laure Martin-Magniette. Variable selection for
clustering with gaussian mixture models. Biometrics, 65(3):701–709, 2009.

[McS01] Frank McSherry. Spectral partitioning of random graphs. In Foundations of Computer Sci-
ence, 2001. Proceedings. 42nd IEEE Symposium on, pages 529–537. IEEE, 2001.

[MKB79] K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate analysis. Academic Press, London,
1979.

[MM11] Cathy Maugis and Bertrand Michel. A non asymptotic penalized criterion for gaussian mix-
ture model selection. ESAIM: Probability and Statistics, 15:41–68, 2011.

[MM18a] Marco Mondelli and Andrea Montanari. Fundamental limits of weak recovery with applica-
tions to phase retrieval. In Conference On Learning Theory, pages 1445–1450, 2018.

156
[MM18b] Marco Mondelli and Andrea Montanari. On the connection between learning two-layers
neural networks and tensor decomposition. arXiv preprint arXiv:1802.07301, 2018.

[MMV12] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Approximation


algorithms for semi-random partitioning problems. In Proceedings of the forty-fourth annual
ACM symposium on Theory of computing, pages 367–384. ACM, 2012.

[MMV15] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Correlation clus-
tering with noisy partial information. In Conference on Learning Theory, pages 1321–1342,
2015.

[MNS15] Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and estimation in the planted
partition model. Probability Theory and Related Fields, 162(3-4):431–461, 2015.

[MNS18] Elchanan Mossel, Joe Neeman, and Allan Sly. A proof of the block model threshold conjec-
ture. Combinatorica, 38(3):665–708, 2018.

[Mon15] Andrea Montanari. Finding one community in a sparse graph. Journal of Statistical Physics,
161(2):273–299, 2015.

[Moo17] Cristopher Moore. The computer science and physics of community detection: Landscapes,
phase transitions, and hardness. arXiv preprint arXiv:1702.00467, 2017.

[MP04] Geoffrey J McLachlan and David Peel. Finite mixture models. John Wiley & Sons, 2004.

[MPW15] Raghu Meka, Aaron Potechin, and Avi Wigderson. Sum-of-squares lower bounds for planted
clique. pages 87–96, 2015.

[MPW16] Ankur Moitra, William Perry, and Alexander S Wein. How robust are reconstruction thresh-
olds for community detection? In Proceedings of the forty-eighth annual ACM symposium
on Theory of Computing, pages 828–841. ACM, 2016.

[MRX19] Sidhanth Mohanty, Prasad Raghavendra, and Jeff Xu. Lifting sum-of-squares lower bounds:
Degree-2 to degree-4. arXiv preprint arXiv:1911.01411, 2019.

[MS10] Claire Mathieu and Warren Schudy. Correlation clustering with noisy input. In Proceedings
of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 712–728.
Society for Industrial and Applied Mathematics, 2010.

[MW15a] Tengyu Ma and Avi Wigderson. Sum-of-squares lower bounds for sparse pca. In Advances
in Neural Information Processing Systems, pages 1612–1620, 2015.

[MW15b] Zongming Ma and Yihong Wu. Computational barriers in minimax submatrix detection. The
Annals of Statistics, 43(3):1089–1116, 2015.

[MWFSG16] Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter, and Bettina Grün. Model-based clus-
tering based on sparse finite gaussian mixtures. Statistics and computing, 26(1-2):303–324,
2016.

[NN12] Raj Rao Nadakuditi and Mark EJ Newman. Graph spectra and the detectability of community
structure in networks. Physical review letters, 108(18):188701, 2012.

[NN14] Joe Neeman and Praneeth Netrapalli. Non-reconstructability in the stochastic block model.
arXiv preprint arXiv:1404.6304, 2014.

157
[PS07] Wei Pan and Xiaotong Shen. Penalized model-based clustering with application to variable
selection. Journal of Machine Learning Research, 8(May):1145–1164, 2007.

[PSBR18] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust
estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.

[PTB09] Elena Parkhomenko, David Tritchler, and Joseph Beyene. Sparse canonical correlation anal-
ysis with application to genomic data integration. Statistical applications in genetics and
molecular biology, 8(1):1–34, 2009.

[PW17] Amelia Perry and Alexander S Wein. A semidefinite program for unbalanced multisection
in the stochastic block model. In 2017 International Conference on Sampling Theory and
Applications (SampTA), pages 64–67. IEEE, 2017.

[PWB+ 20] Amelia Perry, Alexander S Wein, Afonso S Bandeira, et al. Statistical limits of spiked tensor
models. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 56,
pages 230–264. Institut Henri Poincaré, 2020.

[QR78] Richard E Quandt and James B Ramsey. Estimating mixtures of normal distributions and
switching regressions. Journal of the American statistical Association, 73(364):730–738,
1978.

[RBBC19] Valentina Ros, Gerard Ben Arous, Giulio Biroli, and Chiara Cammarota. Complex energy
landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local
minima, and phase transitions. Physical Review X, 9(1):011003, 2019.

[RCY+ 11] Karl Rohe, Sourav Chatterjee, Bin Yu, et al. Spectral clustering and the high-dimensional
stochastic blockmodel. The Annals of Statistics, 39(4):1878–1915, 2011.

[RD06] Adrian E Raftery and Nema Dean. Variable selection for model-based clustering. Journal of
the American Statistical Association, 101(473):168–178, 2006.

[Rem13] Reinhold Remmert. Classical topics in complex function theory, volume 172. Springer Sci-
ence & Business Media, 2013.

[RL05] Peter J Rousseeuw and Annick M Leroy. Robust regression and outlier detection, volume
589. John wiley & sons, 2005.

[RM14] Emile Richard and Andrea Montanari. A statistical model for tensor pca. In Advances in
Neural Information Processing Systems, pages 2897–2905, 2014.

[Ros08] Benjamin Rossman. On the constant-depth complexity of k-clique. In Proceedings of the


fortieth annual ACM symposium on Theory of computing, pages 721–730. ACM, 2008.

[Ros14] Benjamin Rossman. The monotone complexity of k-clique on random graphs. SIAM Journal
on Computing, 43(1):256–279, 2014.

[RR97] Alexander A Razborov and Steven Rudich. Natural proofs. Journal of Computer and System
Sciences, 55(1):24–35, 1997.

[RR19] Miklós Z Rácz and Jacob Richey. A smooth transition from wishart to goe. Journal of
Theoretical Probability, 32(2):898–906, 2019.

158
[RSS18] Prasad Raghavendra, Tselil Schramm, and David Steurer. High-dimensional estimation via
sum-of-squares proofs. arXiv preprint arXiv:1807.11419, 6, 2018.

[RTSZ19] Federico Ricci-Tersenghi, Guilhem Semerjian, and Lenka Zdeborová. Typology of phase
transitions in Bayesian inference problems. Physical Review E, 99(4):042109, 2019.

[RWY10] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for
correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241–2259,
2010.

[RY20] Prasad Raghavendra and Morris Yau. List decodable learning via sum of squares. In Pro-
ceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages
161–180. SIAM, 2020.

[SBVDG10] Nicolas Städler, Peter Bühlmann, and Sara Van De Geer. `1 -penalization for mixture regres-
sion models. Test, 19(2):209–256, 2010.

[Ser99] Rocco A Servedio. Computational sample complexity and attribute-efficient learning. In


Proceedings of the thirty-first annual ACM symposium on Theory of Computing, pages 701–
710, 1999.

[SR14] Philip Schniter and Sundeep Rangan. Compressive phase retrieval via generalized approxi-
mate message passing. IEEE Transactions on Signal Processing, 63(4):1043–1055, 2014.

[Tib96] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society: Series B (Methodological), 58(1):267–288, 1996.

[Tuk75] John W Tukey. Mathematics and the picturing of data. In Proceedings of the International
Congress of Mathematicians, Vancouver, 1975, volume 2, pages 523–531, 1975.

[VA18] Aravindan Vijayaraghavan and Pranjal Awasthi. Clustering semi-random mixtures of gaus-
sians. In International Conference on Machine Learning, pages 5055–5064, 2018.

[VAC15] Nicolas Verzelen and Ery Arias-Castro. Community detection in sparse random networks.
The Annals of Applied Probability, 25(6):3465–3510, 2015.

[VAC17] Nicolas Verzelen and Ery Arias-Castro. Detection and feature selection in sparse mixture
models. The Annals of Statistics, 45(5):1920–1950, 2017.

[Val84] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142,
1984.

[Vu05] Van H Vu. Spectral norm of random matrices. In Proceedings of the thirty-seventh annual
ACM symposium on Theory of computing, pages 423–430. ACM, 2005.

[WBP16] Tengyao Wang, Quentin Berthet, and Yaniv Plan. Average-case hardness of rip certification.
In Advances in Neural Information Processing Systems, pages 3819–3827, 2016.

[WBS16] Tengyao Wang, Quentin Berthet, and Richard J Samworth. Statistical and computational
trade-offs in estimation of sparse principal components. The Annals of Statistics, 44(5):1896–
1930, 2016.

[WD95] Michel Wedel and Wayne S DeSarbo. A mixture likelihood approach for generalized linear
models. Journal of classification, 12(1):21–55, 1995.

159
[WEAM19] Alexander S Wein, Ahmed El Alaoui, and Cristopher Moore. The kikuchi hierarchy and
tensor pca. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science
(FOCS), pages 1446–1468. IEEE, 2019.

[WGNL14] Zhaoran Wang, Quanquan Gu, Yang Ning, and Han Liu. High dimensional expectation-
maximization algorithm: Statistical optimization and asymptotic normality. arXiv preprint
arXiv:1412.8729, 2014.

[Wis28] John Wishart. The generalised product moment distribution in samples from a normal multi-
variate population. Biometrika, pages 32–52, 1928.

[WLY13] Dong Wang, Huchuan Lu, and Ming-Hsuan Yang. Online object tracking with sparse proto-
types. IEEE transactions on image processing, 22(1):314–325, 2013.

[WX18] Yihong Wu and Jiaming Xu. Statistical problems with planted structures: Information-
theoretical and computational limits. arXiv preprint arXiv:1806.00118, 2018.

[WZG+ 17] Gang Wang, Liang Zhang, Georgios B Giannakis, Mehmet Akçakaya, and Jie Chen. Sparse
phase retrieval via truncated amplitude flow. IEEE Transactions on Signal Processing,
66(2):479–491, 2017.

[Yat85] Yannis G Yatracos. Rates of convergence of minimum distance estimators and kolmogorov’s
entropy. The Annals of Statistics, pages 768–774, 1985.

[YC15] Xinyang Yi and Constantine Caramanis. Regularized em algorithms: A unified framework


and statistical guarantees. In Advances in Neural Information Processing Systems, pages
1567–1575, 2015.

[YCS14] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for
mixed linear regression. In International Conference on Machine Learning, pages 613–621,
2014.

[YPCR18] Dong Yin, Ramtin Pedarsani, Yudong Chen, and Kannan Ramchandran. Learning mixtures
of sparse linear regressions using sparse graph codes. IEEE Transactions on Information
Theory, 65(3):1430–1451, 2018.

[ZHT06] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal
of computational and graphical statistics, 15(2):265–286, 2006.

[ZK16] Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: Thresholds and
algorithms. Advances in Physics, 65(5):453–552, 2016.

[ZWJ14] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the perfor-
mance of polynomial-time algorithms for sparse linear regression. In COLT, pages 921–948,
2014.

[ZX17] Anru Zhang and Dong Xia. Tensor svd: Statistical and computational limits. arXiv preprint
arXiv:1703.02724, 2017.

[ZZ04] Hong-Tu Zhu and Heping Zhang. Hypothesis testing in mixture regression models. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 66(1):3–16, 2004.

160
Part IV
Appendix
A Deferred Proofs from Part II
A.1 Proofs of Total Variation Properties
In this section, we present several deferred proofs from Sections 6.2 and 17. We first prove Lemma 6.3.

Proof of Lemma 6.3. This follows from a simple induction on m. Note that the case when m = 1 follows
by definition. Now observe that by the data-processing and triangle inequalities of total variation, we have
that if B = Am−1 ◦ Am−2 ◦ · · · ◦ A1 then

dTV (A(P0 ), Pm ) ≤ dTV (Am ◦ B(P0 ), Am (Pm−1 )) + dTV (Am (Pm−1 ), Pm )


≤ dTV (B(P0 ), Pm−1 ) + m
Xm
≤ i
i=1

where the last inequality follows from the induction hypothesis applied with m − 1 to B. This completes
the induction and proves the lemma.

We now prove Lemma 6.4 upper bounding the total variation distance between vectors of unplanted and
planted samples from binomial distributions.

Proof of Lemma 6.4. Given someP ∈ [0, 1], we begin by computing χ2 (Bern(P ) + Bin(m − 1, Q), Bin(m, Q)).
For notational convenience, let ab = 0 if b > a or b < 0. It follows that

1 + χ2 (Bern(P ) + Bin(m − 1, Q), Bin(m, Q))


 2
(1 − P ) · m−1 m−1−t + P · m−1 Qt−1 (1 − Q)m−t
 t 
Xm
t Q (1 − Q) t−1
= m

t m−t
t=0 t Q (1 − Q)
m
t P 2
   
m−t m − t 1 − P
X m
t
= Q (1 − Q) · + ·
t m 1−Q m Q
t=0
"  #
m−X 1−P X P 2
=E · + ·
m 1−Q m Q
"  #
X − mQ P −Q 2
=E 1+ ·
m Q(1 − Q)
2(P − Q) (P − Q)2
· E (X − Qm)2
 
=1+ · E[X − mQ] + 2 2 2
mQ(1 − Q) m Q (1 − Q)
(P − Q) 2
=1+
mQ(1 − Q)

where X ∼ Bin(m, Q) and the second last equality follows from E[X] = Qm and E[(X − Qm)2 ] =
Var[X] = Q(1−Q)m. The concavity of log implies that dKL (P, Q) ≤ log 1 + χ2 (P, Q) ≤ χ2 (P, Q) for


161
any two distributions with P absolutely continuous with respect to Q. Pinsker’s inequality and tensorization
of dKL now imply that
 2
2 · dTV ⊗ki=1 (Bern(Pi ) + Bin(m − 1, Q)) , Bin(m, Q)⊗k
 
≤ dKL ⊗ki=1 (Bern(Pi ) + Bin(m − 1, Q)) , Bin(m, Q)⊗k
k
X
= dKL (Bern(Pi ) + Bin(m − 1, Q), Bin(m, Q))
i=1
k k
X X (Pi − Q)2
≤ χ2 (Bern(Pi ) + Bin(m − 1, Q), Bin(m, Q)) =
mQ(1 − Q)
i=1 i=1

which completes the proof of the lemma.

We now prove Lemma 6.5 on the total variation distance between two binomial distributions.

Proof of 6.5. By applying the data processing inequality for dTV to the function taking the sum of the coor-
dinates of a vector, we have that
2
2 · dTV (Bin(n, P ), Bin(n, Q))2 ≤ 2 · dTV Bern(P )⊗n , Bern(Q)⊗n
≤ dKL Bern(P )⊗n , Bern(Q)⊗n


= n · dKL (Bern(P ), Bern(Q))


≤ n · χ2 (Bern(P ), Bern(Q))
(P − Q)2
=n·
Q(1 − Q)
The second inequality is an application of Pinsker’s, the first equality is tensorization of dKL and the third
inequality is the fact that χ2 upper bounds dKL by the concavity of log. This completes the proof of the
lemma.

We conclude this section with a proof of Lemma 17.2, establishing the key property of reductions in
total variation among recovery problems.

Proof of Lemma 17.2. As in the proof of Lemma 6.1 from [BBH18], this lemma follows from a simple
application of the definition of dTV . Suppose that there is such an E 0 . Now consider the algorithm E that
proceeds as follows on an input X of P(n, τ ):
1. compute A(X) and the output θˆ0 of E 0 on input A(X); and
2. output the result θ̂ ← B(X, θˆ0 ).
Suppose that X ∼ PD (θ) for some θ ∈ ΘP . Consider a coupling of X, the randomness of A and
Y ∼ Eθ0 ∼D(θ) PD 0 (θ 0 ) such that P[A(X) 6= Y ] = o (1). Since Y is distributed as a mixture of P 0 (θ 0 ),
n D
conditioned on θ , it holds that E 0 succeeds with probability
0
h i
0 0 0 0
P `P 0 (E (Y ), θ ) ≤ τ θ ≥ p

Marginalizing this over θ0 yields that P [`P 0 (E 0 (Y ), θ0 ) ≤ τ 0 for some θ0 ∈ supp D(θ)] ≥ p. Now since
A(X) = Y is a probability 1 − on (1) event, we have that the intersection of this and the event above occurs
with probability p − on (1). Therefore
h i
P `P 0 (θ0 , θˆ0 ) ≤ τ 0 for some θ0 ∈ supp D(θ) ≥ P A(X) = Y and E 0 succeeds ≥ p − on (1)
 

162
Now note that the definition of B implies that
h i h i
P `P (θ, θ̂) ≤ τ ≥ P `P 0 (θ0 , θˆ0 ) ≤ τ 0 for some θ0 ∈ supp D(θ) and B succeeds
h i
≥ P `P 0 (θ0 , θˆ0 ) ≤ τ 0 for some θ0 ∈ supp D(θ) − P [B fails]
≥ p − on (1)

which completes the proof of the lemma.

A.2 Proofs for To-k-Partite-Submatrix


In this section, we prove Lemma 7.5, which establishes the approximate Markov transition properties of the
reduction T O -k-PARTITE -S UBMATRIX. We first establish analogue of Lemma 6.4 from [BBH19] in the
k-partite case to analyze the planted diagonal entries in Step 2 of T O -k-PARTITE -S UBMATRIX.
 
P
Lemma A.1 (Planting k-Partite Diagonals). Suppose that 0 < Q < P ≤ 1 and n ≥ Q + 1 N is such
that both N and n are divisible by k and k ≤ QN/4. Suppose that for each t ∈ [k],

z1t ∼ Bern(P ), z2t ∼ Bin(N/k − 1, P ) and z3t ∼ Bin(n/k, Q)

are independent. If z4t = max{z3t − z1t − z2t , 0}, then it follows that
 r
Q2 N 2 CQ k 2
  
⊗k
dTV ⊗kt=1 L(z1t , z2t
+ z4t ), (Bern(P )
⊗ Bin(n/k − 1, Q)) ≤ 4k · exp − +
48P kn 2n
2 2
 
  Q N
dTV ⊗kt=1 L(z1t + z2t + z4t ), Bin(n/k, Q)⊗k ≤ 4k · exp −
48P kn
n o
Q
where CQ = max 1−Q , 1−Q
Q .

Proof. Throughout this argument, let v denote a vector in {0, 1}k . Now define the event
k
\
z3t = z1t + z2t + z4t

E=
t=1

Now observe that if z3t ≥ Qn/k − QN/2k + 1 and z2t ≤ P (N/k − 1) + QN/2k then it follows that
z3t ≥ 1 + z2t ≥ vt + z2t for any vt ∈ {0, 1} since Qn ≥ (P + Q)N . Now union bounding the probability
that E does not hold conditioned on z1 yields that

h i Xk
C
P z3t < vt + z2t
 
P E z1 = v ≤
t=1
k   X k    
X Qn QN N QN
≤ P < z3t − +1 + t
P z2 > P −1 +
k 2k k 2k
t=1 t=1
! !
(QN/2k − 1)2 (QN/2k)2
≤ k · exp − + k · exp −
3Qn/k 2P (N/k − 1)
Q2 N 2
 
≤ 2k · exp −
48P kn

163
where the third inequality follows from standard Chernoff bounds on the tails of the binomial distribution.
Marginalizing this bound over v ∼ L(z1 ) = Bern(P )⊗k , we have that
Q2 N 2
h i  
 C C
P E = Ev∼L(z1 ) P E z1 = v ≤ 2k · exp −
48P kn
Now consider the total variation error induced by conditioning each of the product measures ⊗kt=1 L(z1t +
z2t + z4t ) and ⊗kt=1 L(z3t ) on the event E. Note that under E, by definition, we have that z3t = z1t + z2t + z4t
for each t ∈ [k]. By the conditioning property of dTV in Fact 6.2, we have
   
dTV ⊗kt=1 L(z1t + z2t + z4t ), L z3t : t ∈ [k] E ≤ P EC
 
   
dTV ⊗kt=1 L(z3t ), L z3t : t ∈ [k] E ≤ P EC
 

The fact that ⊗kt=1 L(z3t ) = Bin(n/k, Q)⊗k and the triangle inequality now imply that
Q2 N 2
   
k t t t ⊗k
 C
dTV ⊗t=1 L(z1 + z2 + z4 ), Bin(n/k, Q) ≤ 2 · P E ≤ 4k · exp −
48P kn
which proves the second inequality in the statement of the lemma. It suffices to establish the first inequality.
A similar conditioning step as above shows that for all v ∈ {0, 1}k , we have that
      h i
dTV ⊗kt=1 L vt + z2t + z4t z1t = vt , L vt + z2t + z4t : t ∈ [k] z1 = v and E ≤ P E C z1 = v

      h i
dTV ⊗kt=1 L z3t z1t = vt , L z3t : t ∈ [k] z1 = v and E ≤ P E C z1 = v

The triangle inequality and the fact that z3 ∼ Bin(n/k, Q)⊗k is independent of z1 implies that
Q2 N 2
     
k t t t ⊗k
dTV ⊗t=1 L vt + z2 + z4 z1 = vt , Bin(n/k, Q) ≤ 4k · exp −
48P kn
By Lemma 6.4 applied with Pt = vt ∈ {0, 1}, we also have that
v
u k r
  uX k(vt − Q)2 CQ k 2
dTV ⊗kt=1 (vt + Bin(n/k − 1, Q)) , Bin(n/k, Q)⊗k ≤t ≤
2nQ(1 − Q) 2n
t=1

The triangle now implies that for each v ∈ {0, 1}k ,


   
dTV ⊗kt=1 L z2t + z4t z1t = vt , Bin(n/k − 1, Q)⊗k

   
= dTV ⊗kt=1 L vt + z2t + z4t z1t = vt , ⊗kt=1 (vt + Bin(n/k − 1, Q))

 r
Q2 N 2 CQ k 2

≤ 4k · exp − +
48P kn 2n
We now marginalize over v ∼ L(z1 ) = Bern(P )⊗k . The conditioning on a random variable property of dTV
in Fact 6.2 implies that
 
dTV ⊗kt=1 L(z1t , z2t + z4t ), (Bern(P ) ⊗ Bin(n/k − 1, Q))⊗k
   
k t t t ⊗k
≤ Ev∼Bern(P )⊗k dTV ⊗t=1 L z2 + z4 z1 = vt , Bin(n/k − 1, Q)

which, when combined with the inequalities above, completes the proof of the lemma.

164
We now apply this lemma to prove Lemma 7.5. The proof of this lemma is a k-partite variant of the
argument used to prove Theorem 6.1 in [BBH19]. However, it involves several technical subtleties that do
not arise in the non k-partite case.

Proof of Lemma 7.5. Fix some subset R ⊆ [N ] such that |R ∩ Ei | = 1 for each i ∈ [k]. We will first
show that A maps an input G ∼ G(N, R, p, q) approximately in total variation to a sample from the planted
submatrix distribution M[n]×[n] (Un (F ), Bern(p), Bern(Q)). By AM-GM, we have that

√ p+q (1 − p) + (1 − q) p
pq ≤ =1− ≤ 1 − (1 − p)(1 − q)
2 2
 2
If p 6= 1, it follows that P = p > Q = 1 − (1 − p)(1 − q). This implies that 1−p 1−P
p
1−q = 1−Q and
 2 √   2
the inequality above rearranges to Q P
≤ pq . If p = 1, then Q = q and Q P
= pq . Furthermore,
 2
the inequality 1−p
1−q ≤ 1−P
1−Q holds trivially. Therefore we may apply Lemma 7.3, which implies that
(G1 , G2 ) ∼ G(N, R, p, Q)⊗2 .
Let the random set U = {π1−1 (R ∩ E1 ), π2−1 (R ∩ E2 ), . . . , πk−1 (R ∩ Ek )} denote the support of the
k-subset of [n] that R is mapped to in the embedding step of T O -k-PARTITE -S UBMATRIX. Now fix some
k-subset R0 ⊆ [n] with |R0 ∩ Fi | = 1 for each i ∈ [k] and consider the distribution of MPD conditioned on
the event U = R0 . Since (G1 , G2 ) ∼ G(n, R, p, Q)⊗2 , Step 2 of T O -k-PARTITE -S UBMATRIX ensures that
the off-diagonal entries of MPD , given this conditioning, are independent and distributed as follows:

• Mij ∼ Bern(p) if i 6= j and i, j ∈ R0 ; and

• Mij ∼ Bern(Q) if i 6= j and i 6∈ R0 or j 6∈ R0 .

which match the corresponding entries of M[n]×[n] (R0 × R0 , Bern(p), Bern(Q)). Furthermore, these entries
are independent of the vector diag(MPD ) = ((MPD )ii : i ∈ [k]) of the diagonal entries of MPD . It therefore
follows that
   
dTV L MPD U = R0 , M[n]×[n] R0 × R0 , Bern(p), Bern(Q)

   
= dTV L diag(MPD ) U = R0 , M[n] R0 , Bern(p), Bern(Q)

Let (S10 , S20 , . . . , Sk0 ) be any tuple of fixed subsets such that |St0 | = N/k, Si0 ⊆ Ft and R0 ∩ Ft ∈ St0 for each
t ∈ [k]. Now consider the distribution of diag(MPD ) conditioned on both U = R0 and (S1 , S2 , . . . , Sk ) =
(S10 , S20 , . . . , Sk0 ). It holds by construction that the k vectors diag(MPD )Ft are independent for t ∈ [k] and
each distributed as follows:

• diag(MPD )St0 is an exchangeable distribution on {0, 1}N/k with support of size st1 ∼ Bin(N/k, p),
by construction. This implies that diag(MPD )St0 ∼ Bern(p)⊗N/k . This can trivially be restated as
 
MR0 ∩Ft ,R0 ∩Ft , diag(MPD )St0 \R0 ∼ Bern(p) ⊗ Bern(p)⊗N/k−1 .

• diag(MPD )Ft \St0 is an exchangeable distribution on {0, 1}N/k with support of size z4t = max{st2 −
st1 , 0}. Furthermore, diag(MPD )Ft \St0 is independent of diag(MPD )St0 .

For each t ∈ [k], let z1t = MR0 ∩Ft ,R0 ∩Ft ∼ Bern(p) and z2t ∼ Bin(N/k − 1, p) be the size of the support of
diag(MPD )St0 \R0 . As shown discussed in the first point above, we have that z1t and z2t are independent and
z1t + z2t = st1 .

165
Now consider the distribution of diag(MPD ) relaxed to only be conditioned on U = R0 , and no longer on
(S1 , S2 , . . . , Sk ) = (S10 , S20 , . . . , Sk0 ). Conditioned on U = R0 , the St are independent and each uniformly
distributed among all N/k size subsets of Ft that contain the element R0 ∩ Ft . In particular, this implies that
the distribution of diag(MPD )Ft \R0 is an exchangeable distribution on {0, 1}n/k−1 with support size z2t + z4t
for each t. Note that any v ∼ M[n] (R0 , Bern(p), Bern(Q)) also satisfies that vFt \R0 is exchangeable. This
implies that M[n] (R0 , Bern(p), Bern(Q)) and diag(MPD ) are identically distributed when conditioned on
their entries with indices in R0 and on their support sizes within the k sets of indices Ft \R0 . The conditioning
property of Fact 6.2 therefore implies that
   
dTV L diag(MPD ) U = R0 , M[n] R0 , Bern(p), Bern(Q)

 
≤ dTV ⊗kt=1 L(z1t , z2t + z4t ), (Bern(p) ⊗ Bin(n/k − 1, Q))⊗k
 r
Q2 N 2 CQ k 2

≤ 4k · exp − +
48P kn 2n

by the first inequality in Lemma A.1. Now observe that U ∼ Un (F ) and thus marginalizing over R0 ∼
L(U ) = Un (F ) and applying the conditioning property of Fact 6.2 yields that

dTV A(G(N, R, p, q)), M[n]×[n] (Un (F ), Bern(p), Bern(Q))
   
≤ ER0 ∼Un (F ) dTV L MPD U = R0 , M[n]×[n] R0 × R0 , Bern(p), Bern(Q)

since MPD ∼ A(G(N, R, p, q)). Applying an identical marginalization over R ∼ UN (E) completes the
proof of the first inequality in the lemma statement.
It suffices to consider the case where G ∼ G(N, q), which follows from an analogous but simpler
argument. By Lemma 7.3, we have that (G1 , G2 ) ∼ G(N, Q)⊗2 . It follows that the entries of MPD are
distributed as (MPD )ij ∼i.i.d. Bern(Q) for all i 6= j independently of diag(MPD ). Now note that the k vectors
diag(MPD )Ft for t ∈ [k] are each exchangeable and have support size st1 + max{st2 − st1 , 0} = z1t + z2t + z4t
where z1t ∼ Bern(p), z2t ∼ Bin(N/k − 1, p) and st2 ∼ Bin(n/k, Q) are independent. By the same argument
as above, we have that

dTV L(MPD ), Bern(Q)⊗n×n = dTV L(diag(MPD )), Bern(Q)⊗n


 
 
= dTV ⊗kt=1 L z1t + z2t + z4t , Bin(n/k, Q)


Q2 N 2
 
≤ 4k · exp −
48P kn

by Lemma A.1. Since MPD ∼ A(G(N, q)), this completes the proof of the lemma.

A.3 Proofs for Symmetric 3-ary Rejection Kernels


In this section, we establish the approximate Markov transition properties for symmetric 3-ary rejection
kernels introduced in Section 7.3.

Proof of Lemma 7.7. Define L1 , L2 : X → R to be


dP+ dP− dP+ dP−
L1 (x) = (x) − (x) and L2 (x) = (x) + (x) − 2
dQ dQ dQ dQ

166
Note that if x ∈ S, then the triangle inequality implies that
 
1 a 1
PA (x, 1) ≤ 1+ · |L2 (x)| + · |L1 (x)| ≤ 1
2 4|µ2 | 4|µ1 |
 
1 a 1
PA (x, 1) ≥ 1− · |L2 (x)| − · |L1 (x)| ≥ 0
2 4|µ2 | 4|µ1 |

Similar computations show that 0 ≤ PA (x, 0) ≤ 1 and 0 ≤ PA (x, −1) ≤ 1, implying that each of
these probabilities is well-defined. Now let R1 = PX∼P+ [X ∈ S], R0 = PX∼Q [X ∈ S] and R−1 =
PX∼P− [X ∈ S] where R1 , R0 , R−1 ≥ 1 − δ by assumption.
We now define several useful events. For the sake of analysis, consider continuing to iterate Step 2
even after z is set for the first time for a total of N iterations. Let A1i , A0i and A−1 i be the events that z
is set in the ith iteration of Step 2 when B = 1, B = 0 and B = −1, respectively. Let Bi1 = (A11 )C ∩
(A12 )C ∩ · · · ∩ (A1i−1 )C ∩ A1i be the event that z is set for the first time in the ith iteration of Step 2. Let
C 1 = A11 ∪ A12 ∪ · · · ∪ A1N be the event that z is set in some iteration of Step 2. Define Bi0 , C 0 , Bi−1 and
C −1 analogously. Let z0 be the initialization of z in Step 1.
Now let Z1 ∼ D1 = L(3- SRK(1)), Z0 ∼ D0 = L(3- SRK(0)) and Z−1 ∼ D−1 = L(3- SRK(−1)). Note
that L(Zt |Bit ) = L(Zt |Ati ) for each t ∈ {−1, 0, 1} since Ati is independent of At1 , At2 , . . . , Ati−1 and the
sample z 0 chosen in the ith iteration of Step 2. The independence between Steps 2.1 and 2.3 implies that
   
 1 1 a 1
P Ai = Ex∼Q 1+ · L2 (x) + · L1 (x) · 1S (x)
2 4µ2 4µ1
 
1 a 1 1 δ a −1 1 −1
= R0 + (R1 + R−1 − 2R0 ) + (R1 − R−1 ) ≥ − 1 + |µ2 | + |µ1 |
2 8µ2 8µ1 2 2 2 4
   
1 1−a
P A0i = Ex∼Q
 
1− · L2 (x) · 1S (x)
2 4µ2
 
1 1−a 1 δ 1−a
= R0 − (R1 + R−1 − 2R0 ) ≥ − 1+ · |µ2 |−1
2 8µ2 2 2 4
   
 −1  1 a 1
P Ai = Ex∼Q 1+ · L2 (x) − · L1 (x) · 1S (x)
2 4µ2 4µ1
 
1 a 1 1 δ a 1
= R0 + (R1 + R−1 − 2R0 ) − (R1 − R−1 ) ≥ − 1 + |µ2 |−1 + |µ1 |−1
2 8µ2 4µ1 2 2 2 4

The independence of the Ati for each t ∈ {−1, 0, 1} implies that


N   N
 t Y 1 δ 1 −1 −1
1 − P Ati ≤
 
1−P C = + 1 + |µ2 | + |µ1 |
2 2 2
i=1

Note that L(Zt |Ati ) are each absolutely continuous with respect to Q or each t ∈ {−1, 0, 1}, with Radon-
Nikodym derivatives given by

dL(Z1 |Bi1 ) dL(Z1 |A1i )


 
1 a 1
(x) = (x) = 1+ · L2 (x) + · L1 (x) · 1S (x)
2 · P A1i
 
dQ dQ 4µ2 4µ1
dL(Z0 |Bi0 ) dL(Z0 |A0i )
 
1 1−a
(x) = (x) = 1− · L2 (x) · 1S (x)
2 · P A1i
 
dQ dQ 4µ2
dL(Z−1 |Bi−1 ) dL(Z−1 |A−1
 
i ) 1 a 1
(x) = (x) =   1+ · L2 (x) − · L1 (x) · 1S (x)
dQ dQ 2 · P A1i 4µ2 4µ1

167
Fix one of t ∈ {−1, 0, 1} and note that since the conditional laws L(Zt |Bit ) are all identical, we have that

dDt   dL(Zt |B1t )


(x) = P C t · (x) + 1 − P C t · 1z0 (x)
 
dQ dQ
Therefore it follows that
t)
 
1 dDt dL(Z t |B
dTV Dt , L(Zt |B1t ) 1

= · Ex∼Q (x) − (x)
2 dQ dQ
dL(Zt |B1t )
 
1  t 
(x) = 1 − P C t
 
≤ 1 − P C · Ex∼Q 1z0 (x) +
2 dQ
a 1
by the triangle inequality. Since 1 + 4µ2 · L2 (x) + 4µ1 · L1 (x) ≥ 0 for x ∈ S, we have that

dL(Z1 |B11 )
   
a 1
Ex∼Q (x) − 1 + · L2 (x) + · L1 (x)
dQ 4µ2 4µ1
  
1 a 1
=
  − 1 · Ex∼Q∗n

1+ · L2 (x) + · L1 (x) · 1S (x)
2 · P A1i 4µ2 4µ1
 
a 1
+ Ex∼Q 1 + · L2 (x) + · L1 (x) · 1S C (x)
4µ2 4|µ1 |
   
1 1
a dP+ dP−
≤ − P[Ai ] + Ex∼Q 1 +
· (x) + (x) + 2 · 1S C (x)
2 4|µ2 | dQ dQ
   
1 dP+ dP−
+ Ex∼Q · (x) + (x) · 1S C (x)
4|µ1 | dQ dQ
     
δ a −1 1 −1 −1 1 −1 3 5 −1 5 −1
≤ 1 + |µ2 | + |µ1 | + δ 1 + a|µ2 | + |µ1 | =δ + |µ2 | + |µ1 |
2 2 4 2 2 4 8
By analogous computations, we have that
dL(Z0 |B10 )
   
1−a
· L2 (x) ≤ 2δ 1 + |µ1 |−1 + |µ2 |−1

Ex∼Q (x) − 1 −
dQ 4µ2
 −1
dL(Z−1 |B1 )
  
a 1
· L1 (x) ≤ 2δ 1 + |µ1 |−1 + |µ2 |−1

Ex∼Q (x) − 1 + · L2 (x) −
dQ 4µ2 4µ1
Now observe that
     
dP+ 1−a a 1 1−a
(x) = + µ1 + µ2 · 1 + · L2 (x) + · L1 (x) + (a − 2µ2 ) · 1 − · L2 (x)
dQ 2 4µ2 4µ1 4µ2
   
1−a a 1
+ − µ1 + µ2 · 1 + · L2 (x) − · L1 (x)
2 4µ2 4µ1
   
1−a a 1 1−a
1= · 1+ · L2 (x) + · L1 (x) + a · 1 − · L2 (x)
2 4µ2 4µ1 4µ2
 
1−a a 1
+ · 1+ · L2 (x) − · L1 (x)
2 4µ2 4µ1
     
dP− 1−a a 1 1−a
(x) = − µ1 + µ2 · 1 + · L2 (x) + · L1 (x) + (a − 2µ2 ) · 1 − · L2 (x)
dQ 2 4µ2 4µ1 4µ2
   
1−a a 1
+ + µ1 + µ2 · 1 + · L2 (x) − · L1 (x)
2 4µ2 4µ1

168
Let D∗ be the mixture of L(Z1 |B11 ), L(Z0 |B10 ) and L(Z−1 |B1−1 ) with weights 1−a
2 + µ1 + µ2 , a − 2µ2 and
1−a
2 − µ1 + µ2 , respectively. It then follows by the triangle inequality that

dTV (3- SRK(Tern(a, µ1 , µ2 )), P+ )


≤ dTV (D∗ , P+ ) + dTV (D∗ , 3- SRK(Tern(a, µ1 , µ2 )))
dL(Z1 |B11 )
     
1−a a 1
≤ + µ1 + µ2 · Ex∼Q (x) − 1 + · L2 (x) + · L1 (x)
2 dQ 4µ2 4µ1
dL(Z0 |B10 )
   
1−a
+ (a − 2µ2 ) · Ex∼Q (x) − 1 − · L2 (x)
dQ 4µ2
dL(Z−1 |B1−1 )
     
1−a a 1
+ − µ1 + µ2 · Ex∼Q (x) − 1 + · L2 (x) − · L1 (x)
2 dQ 4µ2 4µ1
 
1−a
+ µ1 + µ2 · dTV D1 , L(Z1 |B11 ) + (a − 2µ2 ) · dTV D1 , L(Z0 |B10 )
 
+
2
 
1−a
− µ1 + µ2 · dTV D−1 , L(Z−1 |B1−1 )

+
2
 N
 
−1 −1
 1 −1 −1
≤ 2δ 1 + |µ1 | + |µ2 | + + δ 1 + |µ1 | + |µ2 |
2

A symmetric argument shows analogous upper bounds on both dTV (3- SRK(Tern(a, −µ1 , µ2 )), P− ) and
dTV (3- SRK(Tern(a, 0, 0)), Q), completing the proof of the lemma.

A.4 Proofs for Label Generation


In this section, we give the two deferred proofs from Section 10.2.

Proof of Lemma 10.10. This lemma follows from a similar argument to Lemma 10.9. As in Lemma 10.9,
the given conditions on C, γ, µ0 and N imply that
2
γ · y0

2 ≤1
µ0 (1 + γ 2 )

and thus X 0 is well-defined almost surely. First observe that if Z = µ00 · u + G0 where G0 ∼ N (0, Id ) then
s 2
0 0 γ · y0

0 aγ · y γ · y 0 1 1
X = 2
·u+ 0 2
·G + √ · 1−2 0 2
·G+ √ ·W
1+γ µ (1 + γ ) 2 µ (1 + γ ) 2

where a = µ00 /µ0 . Thus by the same argument as in Lemma 10.9, we have that

γ2
 
aγ · y
L(X 0 |y 0 ) = N · u, I d − · uu>
1 + γ2 1 + γ2

Now note that by the conditioning property of multivariate Gaussians, we have that

L(X|y) = N ΣXy Σ−1 −1



yy · y, ΣXX − ΣXy Σyy ΣyX

It is easily verified that

aγ γ2
ΣXy Σ−1
yy = · u and ΣXX − ΣXy Σ−1
yy ΣyX = Id − · uu>
1 + γ2 1 + γ2

169
and thus L(X|y) and L(X 0 |y 0 ) are equidistributed. Since y ∼ N (0, 1 + γ 2 ), it follows by the same appli-
cation of the conditioning property in Fact 6.2 as in Lemma 10.9 implies that
 2

dTV L(X, y), L(X 0 , y 0 ) ≤ dTV L(y), L(y 0 ) = O N −C /2
 

which completes the proof of the lemma.

Proof of Lemma 10.11. This lemma follows from a similar argument to Lemma 10.9. As in Lemmas 10.9
and 10.10, the given conditions imply that X 0 is well-defined almost surely. Conditioned on y 0 , it holds
that Z, G and W are independent. Therefore the three terms in the definition of X 0 are independent and
distributed as
2 !
γ · y0 γ · y0

· Z ∼ N 0, · Id ,
µ0 (1 + γ 2 ) µ0 (1 + γ 2 )
s 2 2 !
γ · y0 γ · y0
 
1 1
√ · 1−2 · G ∼ N 0, · Id − · Id and
2 µ0 (1 + γ 2 ) 2 µ0 (1 + γ 2 )
 
1 1
√ · W ∼ N 0, · Id
2 2

conditioned on y 0 . It follows that X 0 |y 0 ∼ N (0, Id ) and thus X 0 is independent of y 0 . Now let X ∈ Rd and
y ∈ R be such that X ∼ N (0, Id ) and y ∼ N (0, 1 + γ 2 ) are independent. The same application of the
conditioning property in Fact 6.2 as in Lemmas 10.9 and 10.10 now completes the proof of the lemma.

B Deferred Proofs from Part III


B.1 Proofs from Secret Leakage and the PCρ Conjecture
In this section, we present the deferred proof of Lemma 12.15 from Section 12. The proof of this lemma is
similar to the proof of Lemma 5.2 in [FGR+ 13].

Proof of Lemma 12.15. TheP proof is


almost identical to Lemma 5.2 in [FGR+ 13] and we give a sketch here.
|S∩T | k 2 /n2 . If the only constraint on A is its
P
Lemma 12.14 implies that T ∈A hD bS, Db T iD ≤
T ∈A 2
cardinality, then the maximum value for the RHS is obtained by adding S to A, next {T : |T ∩ S| = k − 1},
and so forth with decreasing size of |T ∩ S|, and we assume that A is defined in this manner. Letting
Tλ = {T : |T ∩ S| = λ}|, set λ0 = min{λ : Tλ 6= ∅} so that Tλ ⊆ A for λ > λ0 . We bound the ratio
k n k−j
 
|Tj | j k jn |T0 | |S|
= ≥ = jn2δ hence |Tj | ≤ ≤ .
|Tj+1 | k n k−j−1 k2 (j − 1)!n 2δj (j − 1)!n2δj
 
j+1 k

Now X X 1
|A| ≤ |Tj | ≤ |S|n−2δλ0 ≤ 2|S|n−2δλ0
j≥λ0 j≥λ0
(j − 1)!n2δ(j−λ0 )

for n greater
P than|S∩T
some constant. Thus if |A| ≥ 2|S|/n2`δ , we
Pmust conclude that ` ≥ λ0 . We bound the
≤ j=λ0 2j |Tj ∩ A| ≤ 2λ0 |Tλ0 ∩ A| + kj=λ0 +1 2j |Tj | ≤ 2λ0 |A| + 2λ0 +2 |Tλ0 +1 | ≤
|
Pk
quantity T ∈A 2
2λ0 +3 |A| ≤ 2`+3 |A|. Here we used that |Tj+1 | ≤ |Tj |n−2δ to bound by a geometric series and also
that Tλ0 +1 ⊆ A. Rearranging and combining with the inequality at the start of the proof concludes the
argument.

170
B.2 Proofs for Reductions and Computational Lower Bounds
In this section, we present a number of deferred proofs from Part III. The majority of these proofs are similar
to other proofs presented in the main body of the paper.

Proof of Theorem 3.7. To prove this theorem, we will to show that Theorem 10.8 implies that k- BPDS - TO - MSLR
applied with r > 2 fills out all of the possible growth rates specified by the computational lower bound
n = õ(k 2 2 /τ 4 ) and the other conditions in the theorem statement. As discussed above, it suffices to reduce
in total variation to MSLR(n, k, d, τ, 1/r) where 1/r ≤ .
Fix a constant pair of probabilities 0 < q < p ≤ 1 and any sequence of parameters (n, k, d, τ, ) all of
which are implicitly functions of n such that (n, −1 ) satisfies ( T ) and (n, k, d, τ, ) satisfy the conditions
k 2 2
n≤c· , wk ≤ n1/6 and wk 2 ≤ d
w2 · τ 4 · (log n)4+2c0
for sufficiently large n, an arbitrarily slow-growing function w = w(n) → ∞ at least satisfying that
w(n) = no(1) , a sufficiently small constant c > 0 and a sufficiently large constant c0 > 0. The rest of
this proof will follow that of Theorem 3.1 very closely. In order to fulfill the criteria in Condition 6.1, we
specify M, N, kM , kN and n0 exactly as in Theorem 3.1. As in Theorem 3.1, we have the inequalities
 2t
k 2 2

0 −2 2t r
n ≤w r =O ·
n τ 4 · (log n)2+2c0
1/2
!
c1/4 1/2 k 1/2 rt/2 kM
τ ≤ 1/4 =Θ ·p
n (log n)(2+c0 )/2 n1/4 rt+1 (log n)2+c0
Furthermore, we also have that
c1/2 · k rt
 
2 kN kM
τ ≤ =O ·
wn1/2 · (log n)2+c0 n N log(M N )

As long as n = Θ̃(rt ) then: (2.1) the inequality above on n0 would imply that (n0 , k, d, τ, ) is in the
desired hard regime; (2.2) n and n0 have the same growth rate since w = no(1) ; and (2.3) n  M 3 , d ≥ M
and taking c0 large enough would imply that τ satisfies the bounds needed to apply Theorem 10.8 to yield
the desired reduction. By Lemma 13.2, there is an infinite subsequence of the input parameters such that

n = Θ̃(rt ), which concludes the proof as in Theorem 3.1.

Proof of Lemma 14.8. First suppose that M ∼ GHPMD (n, r, C, D, γ) where C and D are each sequences
of r disjoint sets of size K. Since the Mij are independent for 1 ≤ i, j ≤ n, we now have that
n
X rK 2
E Mij2 − 1 = rK 2 · γ 2 + · γ2
 
E[sC (M )] =
r−1
i,j=1
n
X rK 2
Var Mij2 − 1 = rK 2 · 4γ 2 + · γ 2 + 2n2
 
Var [sC (M )] =
(r − 1)3
i,j=1

Here, we have used the following facts. If X ∼ N (0, 1), then


" 2 #
γ γ2
E[(γ + X)2 − 1] = γ 2 , E +X −1 =
r−1 (r − 1)2
" 2 #
γ γ2
Var[X 2 − 1] = 2, Var[(γ + X)2 − 1] = 4γ 2 + 2, Var +X −1 = +2
r−1 (r − 1)4

171
Note that sC (M ) is invariant to permuting the rows and columns of M and thus sC (M ) is equidistirbuted
under M ∼ GHPMD (n, r, C, D, γ) and M ∼ GHPMD (n, r, K, γ). Now Chebyshev’s inequality implies the
desired lower bound on sC (M ) in (1) holds with probability 1 − on (1). Now observe that
r X X
X
sI (M ) ≥ Mij = Y
h=1 i∈Ch j∈Dh

holds almost surely by definition when M ∼ GHPMD (n, r, C, D, γ). Note that Y ∼ N (rK 2 γ, rK 2 )
conditioned on C and D and therefore it holds that Y ≥ rK 2 γ − wr1/2 K with probability 1 − on (1). The
second lower bound in (1) now follows since sI (M ) is equidistirbuted under M ∼ GHPMD (n, r, C, D, γ)
and M ∼ GHPMD (n, r, K, γ).
Now suppose that M ∼ N (0, 1)⊗n×n . In this case, sC (M ) + n2 is distributed as χ2 (n2 ) and the first
upper bound in (2) holds by Chebyshev’s inequality and the fact that χ2 (n2 ) has variance 2n2 . Now note
r X X
X
Y (C, D) = Mij ∼ N (0, rK 2 )
h=1 i∈Ch j∈Dh

Standard gaussian tail bounds imply that


 2 
h
3/2
p i 1 1  3/2
p
P Y (C, D) > 2rK w (log n + log r) ≤ √ exp − 2rK w (log n + log r)
2π 2rK 2
2
≤ (nr)−2rKw

A crude upper bound on the number of pairs (C, D) is


  2
n
rrK = o (nr)2rK

rK
p
and therefore a union bound implies that sI (M ) = maxC,D Y (C, D) ≤ 2rK 3/2 w (log n + log r) with
probability 1 − on (1). This completes the proof of the lemma.

Proof of Corollary 14.11. Consider the following reduction A that adds a simple post-processing step to
k- PDS - TO - GHPM as in Corollary 14.5. On input graph G with N vertices:

1. Form the graph MR by applying k- PDS - TO - GHPM to G with parameters N, r, k, E, `, n, s and µ where
µ is given by √
rt r
 
−1 1 1 −1
µ= ·Φ + · min{P0 , 1 − P0 } · γ
(r − 1) 2 2
and Φ−1 is the inverse of the standard normal CDF.

2. Let G1 be the graph where each edge (i, j) with is in G1 if and only if (MR )ij ≥ 0. Now form G2 as
in Step 2 of Corollary 14.5, while restricting to edges between the two parts.

This clearly runs in poly(N ) time and it suffices to establish its approximate Markov transition properties.
Let A1 denote the first step with input G and output MR , and let A2 denote the second step with input MR
and output G2 . Let C and D be two fixed sequences, each consisting of r disjoint subsets of [ksrt ] of size
krt−1 . Let P1 , P2 ∈ (0, 1) be
   
µ(r − 1) µ
P1 = Φ √ and P2 = Φ − t √
rt r r r

172
Note that by the definition of µ, we have that P1 = 21 + 12 ·min{P0 , 1−P0 }−1 ·γ. Now note that A2 applied to
MR ∼ GHPMD (ksrt , r, C, D, γ) yields an instance of BHPMD (ksrt , r, C, D, γ) with the following modified
edge probabilities:

1. The edge probabilities between vertices Ch and Dh for each 1 ≤ h ≤ r are still P0 + γ.

2. The edge probabilities between Ch1 and Dh2 for each h1 6= h2 are now
     
µ 1 1
P0 + 2 min{P0 , 1 − P0 } · Φ − t √ − = P0 + 2 min{P0 , 1 − P0 } · P2 −
r r 2 2

3. All other edge probabilities are still P0 .

We now apply a similar sequence of inequalities as in Corollary 14.5. For now assume that P0 ≤ 1/2. Using
the fact that all of the edge indicators of this model and the usual definition of BHPM are independent, the
tensorization property in Fact 6.2 and Lemma 6.5, we now have that

dTV A2 GHPMD (ksrt , r, C, D, γ) , BHPMD (ksrt , r, C, D, γ)


 
 ⊗k2 r2t−1 (r−1)   ⊗k2 r2t−1 (r−1) !
γ 1
≤ dTV Bern P0 − , Bern P0 + 2P0 · P2 −
r−1 2
 v
k 2 r2t−1 (r − 1)

γ 1 u
≤ + 2P0 · P2 − ·
u   
r−1 2
t
γ γ
2 P0 − r−1 1 − P0 + r−1
 
γ 1
· O krt

≤ + 2P0 · P2 −
r−1 2

where the third inequality uses the fact that P0 is bounded away from 0 and 1 and γ = o(1). Now note that
   
γ 2P0 µ(r − 1) 1
= · Φ √ −
r−1 r−1 rt r 2

Using the standard Taylor approximation for Φ(x) − 1/2 around zero when x ∈ (−1, 1), we have
         
γ 1 1 µ(r − 1) 1 µ 1
+ 2P 0 · P 2 − = 2P ·
0
Φ √ − − Φ − √ −
r − 1 2 r−1 rt r 2 rt r 2
 3√ 
µ r
=O
r3t

Therefore we have that


√ 
kµ3 r

t t
 
dTV A2 GHPM D (ksr , r, C, D, γ) , BHPM D (ksr , r, C, D, γ) = O
r2t

A nearly identical argument considering the complement of the graph G1 and replacing with P0 with 1 − P0
establishes this bound in the case when P0 > 1/2. Observe that A2 (N (0, 1)⊗n×n ) ∼ GB (n, n, P0 ). Now
consider applying Lemma 6.3 to the steps A1 and A2 as in Corollary 14.5. It can be verified that the given
bound on γ yields the condition on µ needed to apply Theorem 14.9 if c > 0 is sufficiently small. Thus 1
is bounded by Theorem 14.9 and 2 is bounded by the argument above after averaging over C and D and
applying the conditioning property of Fact 6.2. This application of Lemma 6.3 therefore yields the desired
two approximate Markov transition properties and completes the proof of the corollary.

173
Proof of Theorem 3.4. As discussed in the beginning of this section, it suffices to map to G(n, P0 − µ1 )
under H0 and TSI(n, k, k1 , P0 , µ1 , µ2 , µ3 ) under H1 where µ3 = P1 − P0 and µ1 , µ2 ≥ 0. Thus it suffices
to show that the reduction A in Corollary 14.15 fills out all of the possible growth rates specified by the
−P0 )2
computational lower bound P(P01(1−P 0)
= õ(n/k 2 ) and the other conditions in the theorem statement. Fix a
constant pair of probabilities 0 < q < p ≤ 1 and any sequence of parameters (n, k, P1 , P0 ) all of which are
implicitly functions of n such that
(P1 − P0 )2 n
≤c· 3 2 and min{P0 , 1 − P0 } = Ωn (1)
P0 (1 − P0 ) w · k log n
for sufficiently large n, sufficiently small constant c > 0 and an arbitrarily slow-growing increasing positive
integer-valued function w = w(n) → ∞ at least satisfying that w(n) = no(1) . As in the proof of Theorem
3.1, it suffices to specify:
1. a sequence (N, kN ) such that the k- PDS(N, kN , p, q) is hard according to Conjecture 2.3; and
2. a sequence (n0 , k 0 , P1 , P0 , s, t, µ) satisfying: (2.1) the parameters (n0 , k 0 , P1 , P0 ) are in the regime
of the desired computational lower bound for SEMI - CR; (2.2) (n0 , k 0 ) have the same growth rates as
(n, k); and (2.3) such that G(n0 , P0 − µ1 ) and TSI(n0 , k 0 , k 0 /2, P0 , µ1 , µ2 , P1 − P0 ), where k 0 is even
and µ1 , µ2 ≥ 0, can be produced by A with input k- PDS(N, kN , p, q).
We choose these parameters as follows:

• let t be such that 3t is the smallest power of 3 larger than k/ n and let s = d2n/3ke;
• let µ ∈ (0, 1) be given by
 
t −1 1 1 −1
µ=3 ·Φ + · min{P0 , 1 − P0 } (P1 − P0 )
2 2

• now let $  %
p −1 −2 √

1
kN = 1+ w · n
2 Q
p √ 
where Q = 1 − (1 − p)(1 − q) + 1{p=1} q − 1 ; and
t
• let n0 = 3kN s · 3 2−1 , let k 0 = (3t − 1)kN and let N = wkN
2 .

Note that 3t = Θ(k/ n), s = Θ(n/k) and 3t kN s ≤ poly(N ). Note that this choice of µ implies that
   
µ 1
P1 = P0 + 2 min{P0 , 1 − P0 } · Φ t −
3 2
which implies that the instance of TSI output by A has edge density P1 on its k 0 -vertex the planted dense
subgraph. It follows that
k n √
n0  3t kN s  √ · w−2 · n  w−2 · n and k 0  3t kN  w−2 k
n k
(P1 − P0 )2 n n0
≤c· 3 2 .c·
P0 (1 − P0 ) w · k log n w · (k 0 )2 log n0

 
p
m≤2 + 1 wkN 2
≤ w−1 n · kN ≤ 3t kN s
Q

t t n c
µ . 3 · (P1 − P0 ) . 3 · 3/2 √ 0
≤ 3/2 √
w · k log n w log n0

174
where the last bound 
above follows
 from the fact that Φ(x) − 1/2 ∼ x if |x| → 0. Here, m is the smallest
p
multiple of kN larger Q + 1 N . Now note that: (2.1) the third inequality above on (P1 −P0 )2 /P0 (1−P0 )
implies that (n0 , k 0 , P1 , P0 ) is in the desired hard regime; (2.2) (n, n0 ) and (k, k 0 ) have the same growth rates
since w = no(1) ; and (2.3) the last two bounds above imply that taking c small enough yields the conditions
needed to apply Corollary 14.15 to yield the desired reduction. This completes the proof of the theorem.

Proof of Lemma 16.1. The parameters a, µ1 , µ2 for which these distributional statements are true are given
by

a = Φ(τ ) − Φ(−τ )
1 1
µ1 = ((1 − Φ(τ − µ)) − Φ(−τ − µ)) = (Φ(τ + µ) − Φ(τ − µ))
2 2
1 1 1
µ2 = (Φ(τ ) − Φ(−τ )) − (Φ(τ + µ) − Φ(−τ + µ)) = (2 · Φ(τ ) − Φ(τ + µ) − Φ(τ − µ))
2 2 2
Now note that Z τ +µ
1 1 2 /2
µ1 = (Φ(τ + µ) − Φ(τ − µ)) = √ e−t dt = Θ(µ)
2 2 2π τ −µ
2 /2
and is positive since e−tis bounded on [τ − µ, τ + µ] as τ is constant and µ → 0. Furthermore, note that
Z τ Z τ +µ
1 1 −t2 /2 1 2
µ2 = (2 · Φ(τ ) − Φ(τ + µ) − Φ(τ − µ)) = √ e dt − √ e−t /2 dt
2 2 2π τ −µ 2 2π τ
Z τ +µ  Z τ +µ
1 2 2
 1 2
 2

= √ e−(t−µ) /2 − e−t /2 dt = √ e−t /2 etµ−µ /2 − 1 dt
2 2π τ 2 2π τ
2 /2
Now note that as µ → 0 and for t ∈ [τ, τ + µ], it follows that 0 < etµ−µ − 1 = Θ(µ). This implies that
0 < µ2 = Θ(µ2 ), as claimed.

Proof of Theorem 3.10. To prove this theorem, we will to show Theorem 16.2 implies that k- BPDS -TO - GLSM
fills out all of the possible growth rates specified by the computational lower bound n = õ τU−4 and the
other conditions in the theorem statement, as in the proof of Theorems 3.1 and 13.4. Fix a constant pair of
probabilities 0 < q < p ≤ 1 and any sequence (n, k, d, U) where U = (D, Q, {Pν }ν∈R ) ∈ UC(n) all of
which are implicitly functions of n with
c
n≤ and wk 2 ≤ d
τU4 · w2 · (log n)2
for sufficiently large n, an arbitrarily slow-growing function w = w(n) → ∞ and a sufficiently small
constant c > 0. Now consider specifying the parameters M, N, kM , kN and t exactly as in Theorem 13.4.
Now note that under these parameter settings, we have that
s
c1/4 kN
τU ≤ 1/4 1/2 √ ≤ 2c1/4 ·
n w log n N log N

Therefore τU satisfies the conditions needed to apply Theorem 16.2 for a sufficiently small c > 0. The
other parameters (n, k, d, U) and (M, N, kM , kN , p, q) can also be verified to satisfy the conditions of this
theorem. We now have that k- BPDS(M, N, kM , kN , p, q) is hard according to Conjecture 2.3, and that
GLSM (n, k, d, U) can be produced by the reduction k- BPDS - TO - GLSM applied to BPDS (M, N, kM , kN , p, q).
This verifies the criteria in Condition 6.1 and, following the argument in Section 6.2, Lemma 6.1 now implies
the theorem.

175

You might also like