General low-rank sparse matrix

I generate a low-rank matrix with a large number of zeros, then set random elements to zero giving the measured martrix. Then, we find the low rank approximation, and plot the distribution of the elements corresponding to the original zeros. Below, I do it for ten replicates, but plot only the first four.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Simulate  Low Rank Expression Matrix (i.e. sparse, positive valued, low-rank)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clc; clear; close all
rng(1)

%Perhaps there are better ways to make a sparse, low rank matrix.  I did it by
%setting its factors to be sparse.  Then removing rows/columns that are all
%zeros, and repeating until the matrix is large enough.
ni = 20
true_zeros_means = zeros(ni,1);
for (i=1:ni)
m = 1E3; n = 1E3; r=10;  C= 3;
factor_sparsity = 0.7;
M_original = zeros(1,1);
while(size(M_original,2) < 0.75*m ||size(M_original,1) < 0.75*m)
    M1 = C*rand( m, r);
    M2 = C*rand(n,r);

    M1(randperm(numel(M1), round(m*r*factor_sparsity))) = 0;
    M2(randperm(numel(M2), round(n*r*factor_sparsity))) = 0;
    M_original_pre = M1* M2';
    nonempty_rows = m> sum(M_original_pre == 0,2);
    nonempty_cols = m> sum(M_original_pre == 0,1);
    M_original = M_original_pre(nonempty_rows,nonempty_cols);
end

[m_final, n_final] = size(M_original);
[~,S,~] = svd(M_original); 

%All the zeros in M_original are true zeros. Now, let's simulated dropout, and
%perturb the matrix by setting many more elements to be zero.
missing_num = round(size(M_original,1)*size(M_original,2)*0.7);
M_missing = M_original;
M_missing(randperm(numel(M_original), missing_num)) = 0; 

%"Complete" the matrix using SVD
[U S V] = svd(M_missing);
M_approx = U(:,1:r)*S(1:r,1:r)*V(:,1:r)';

true_zeros_means(i) = mean(M_approx(M_original==0));
    if ( i < 5) 
        figure(1); subplot(2,2,i);
        histogram(M_approx(M_original==0 )); 
        title(sprintf('%d x %d matrix, from %.2f to %.2f nnz\nmean %.4f',m_final,n_final,nnz(M_original)/(m*n), nnz(M_missing)/(m*n),true_zeros_means(i)))
        set(gca,'fontsize', 12);

    end
end

It looks quite symmetric around zero. You might also note that the shape of the distribution is not a Gaussian–it is very pointy. I thought it sort of looked like a Laplace distribution, but if I try to fit a Laplace to it, it did not seem to fit.

However, it is not exactly around zero. I wrote the mean of the distribution in the title of each experiment, and you see that it is very small, and slightly positive. Indeed, if you run it 10 times, the means you get are all positive:

true_zeros_means =

    0.0236
    0.0204
    0.0258
    0.0206
    0.0215
    0.0204
    0.0205
    0.0222
    0.0193
    0.0212
    0.0185
    0.0224
    0.0173
    0.0227
    0.0219
    0.0209
    0.0241
    0.0209
    0.0196
    0.0192

So, perhaps it is only “roughly” symmetric around zero; there does appear to be some slight positive bias we had not considered before (as we were just eyeballing it).

But in reality, that is not necessarily what we care about. That the distribution is symmetric around zero would be sufficient, but not necessary, for our imputation procedure (ALRA) to make sense. The necessary and sufficient hypothesis for ALRA is that the error distribution of the true zeros at each row is upper bounded by the magnitude of the most negative element in that row. This is actually how we use ALRA: we threshold each row based on the magnitude of the smallest element. Taking one of the matrices above, let’s see how well this approach preserves the zeros.

mins = repmat(min(M_approx,[],2), 1,n_final);
thresholded  = max(M_approx - abs(mins), 0);
sum(M_original == 0 & thresholded ~=0)/sum(M_original==0)
%0.007

Indeed, it performs extremely well. More than 99% of the zeros preserved by this procedure (this holds for all 10 experiments above). Perhaps this is something easier to prove?

In any case, here is how it looks for each row:

figure(2); clf
for (nroi=1:10)
    subplot(5,2,nroi)
    roi = M_approx(nroi,:);
    histfit(roi(M_original(nroi,:) == 0))
    title(sprintf('Row %d', nroi))
end

You can see how thresholding based on the smallest number can give you <1% error

Boaz’ example

Boaz constructed a rank-3 example,

with the following code, which I modified to be a bit more random

rng(1)
figure(1); clf;

n = 2000; 
p = 1000; 

X = zeros(n,p);  % I view the rows as samples, columns as genes 
u1 = zeros(n,1); v1 = zeros(p,1); 
u2 = zeros(n,1); v2 = zeros(p,1); 
u3 = zeros(n,1); v3 = zeros(p,1); 

n1 = 500; p1=p/2;  
u1(1:n1) = rand(1,1)+ rand(n1,1); 
v1(1:p1) = rand(1,1)+ rand(p1,1); 

n2 = n - n1; p2 = p - p1; 
u2(n1+1:end) = rand(1,1)+2*rand(n2,1); 
v2(p1+1:end) = rand(1,1)+ rand(p2,1); 

u3 = 1+rand(n,1); 
idx_3 = floor(p/3):floor(2*p/3); 
v3(idx_3) = 1+rand(size(idx_3)); 

X = X + u1 * v1' + u2*v2' + u3 * v3'; 

dropout_rate = 0.5; 
W = (rand(n,p) > dropout_rate);   % Wij=1 means that entry is observed
Xt = X.*W;  % THIS IS THE OBSERVED MATRIX

[U0 S0 V0 ] = svd(X); 
[U S V ]= svd(Xt);  % matrix is U * S * V' ; so V(:,1) is the first right singular vector
S0 = diag(S0); 
S = diag(S); 

% reconstructed matrix from low rank svd, here we assume rank is known to
% be 3 
Xr = U(:,1)* S(1) * V(:,1)' + U(:,2)* S(2) * V(:,2)' + U(:,3) * S(3) * V(:,3)'; 

ERR = X - Xr; 
ZERO_idx = (X==0); 

% look at distribution of errors at a specific column c
idx = find(ZERO_idx==1); 

figure(2); clf
for (nroi=1:10)
    subplot(5,2,nroi)
    roi = Xr(nroi,:);
    histfit(roi(X(nroi,:) == 0))
    title(sprintf('Row %d', nroi))
end

When looking at the error distributions for zeros in the entire matrix (as above), we see that it is nicely symmetric…and that the mean is actually slightly less than zero.

So, using a global threshold (i.e. not row-specific) with this matrix would work just fine; it would restore the zeros.

However, when you look at the row-specific distribution of zero errors in the observed matrix, it is not symmetric. Row 6, for example, is particularly disastrous.

for (nroi=1:10)
    subplot(5,2,nroi)
    roi = Xr(nroi,:);
    histfit(roi(X(nroi,:) == 0))
    title(sprintf('Row %d', nroi))
    set(gca,'fontsize', 13);
end

So, if you try to threshold on each row, you do very poorly:

mins = repmat(min(Xr,[],2), 1,p);
Xrt  = max(Xr - abs(mins), 0);
sum(X == 0 & Xrt ~=0)/sum(X==0)
% 0.489

However, as Boaz pointed out, it would work perfectly on the columns! Here are column-specific distribution of zero errors:

Indeed, if you now threshold on each column, you do much better:

mins = repmat(min(Xr,[],1), size(Xr,1),1);
Xrt  = max(Xr - abs(mins), 0);
sum(X == 0 & Xrt ~=0)/sum(X==0)

We have to think more about why this asymmetry arises. Note that it is not simply because each row has a contribution from all three singular vectors whereas each column is only from a single one, because in my example matrix above, every row has a contribution from every singular vector and the distributions are very nice. Note also that something else different about this example is that the center columns of the true matrix have no zeros at all.

That being said, I think it will be hard to prove the row-by-row thresholding works. It might be best to first focus on global thresholding, which works fine in both of the above examples.