Leverage Scores

Given a datamatrix $A \in \mathbb R^{m \times n}$. Let $a_i \in \mathbb R^n$ be the $i$th row of $A$.

$\min_x \|x\|_2 \text{ s.t. } a_i^T = x^T A$

Note that given $U$, where the columns of $U$ are an orthonormal basis for the column space of $A$, the $i$th leverage score is the L2 norm of the $i$th row of this matrix. This can be easily seen as follows:

Compute the SVD of $A$ as $A = U S V^T$, then we have $x = a_i^T V S^{-1} U^T$, so

$\begin{align*} \|x\|_2^2 &= x^T x\\ &= a_i^TUS^{-1}U^T U S^{-1} V^Ta_i\\ &= a_i^T VS^{-1}S^{-1} V^Ta_i\\ &= u_i^TS S^{-1}S^{-1} S u_i \\ &= \|u_i\|_2^2 \end{align*}$

Because $a_i^T V = u_i^T S$, where $U = \begin{bmatrix}u_1^T\\ u_2^T \\ \vdots \\ u_m^T \end{bmatrix}$ .

Experiments

Unequal sized balls

I wanted a simple example to understand how these scores work. So, I generated 10 well separated Gaussian balls of sizes: 100, 200, 300,…,1000. In this case, I would like the sampling probability (i.e. the leverage score) to be decreasing as the size of the balls increases. In other words, there should be high probability of sampling from the small ball, and low probability of sampling from the large ball.

When these balls are in 10 dimensions, this is exactly what we observe in the following MATLAB code:

rng(3)
% Make some well separated Gaussian balls. 
dim = 10;
num_of_balls = 10;
sigma_0 = .0002;
X = [];
for i =1:num_of_balls,
    X = [X; mvnrnd(10*rand(1,dim), diag(sigma_0*ones(dim,1)), i*100)];
end

% Compute Leverage Scores
[U S V] = svd(X,0);
taus =  sqrt(sum(U.^2,2));

figure(1);
plot(taus);
ylabel('Leverage Score')

dim10 So, on every ball, the leverage scores are very similar. Bigger balls have smaller leverage scores. This seems like a great way to sample.

However, when I increase the dimension (i.e. set dim=100 in the code), things don’t look as nice:

dim100

I suppose this is just because every ball has outliers, and the leverage score is large on those outliers. In some sense, every point is an outlier in high dimensions. That being said, if my goal is to sample from the small clusters with higher probability than the large clusters, it does not seem like I would want to sample according to this score.