A New Method for Non-negative Matrix Factorization

What is Non-negative Matrix Factorization?

ODE solution of NMF over time

Non-negative Matrix Factorization (NMF) is an algorithm that decomposes a non-negative matrix $X$ into the product of two non-negative matrices $W$ and $H$ . This can be expressed as:

X = W H

where $X$ is an $n \times m$ matrix, $W$ is $n \times k$ , and $H$ is $k \times m$ . NMF is useful as a dimension reduction technique because $k$ is typically chosen to be much less than $m$ . One way to think about NMF is that each column of $X$ is estimated as linear combinations of the basis vectors in $W$ .

[\begin{matrix} | \\ x_{i} \\ | \end{matrix}] = h_{1, i} [\begin{matrix} | \\ w_{1} \\ | \end{matrix}] + h_{2, i} [\begin{matrix} | \\ w_{2} \\ | \end{matrix}] + \dots + h_{k, i} [\begin{matrix} | \\ w_{k} \\ | \end{matrix}]

The goal is to condense information. Instead of dealing with a large number of data columns ( $m$ total), we can instead work with a small number of basis vectors ( $k$ total).

Multiplicative Update Algorithm

Solving for $W$ and $H$ simultaneously is a non-convex optimization problem (and NP hard); however, solving for one of the matrices while keeping the other constant is convex. Most NMF algorithms take advantage of this by switching back and forth updating $W$ and $H$ until some convergence criteria is met. The first efficient algorithm for NMF performs a simple multiplicative update base on gradient descent:

θ_{t + 1} = θ_{t} - η \circ \nabla D (θ_{t})

where $θ$ is a some parameter vector (i.e., some point in multidimensional space) being updated and $D$ is the loss function that tells you how well the parameters fit the given model. $\nabla D$ is the gradient that points to the steepest direction up (we subtract it because we want to find the minimum loss) and $η$ is the "learning rate" which scales how large a step we take to the minimum. Note that I'm using the $\circ$ symbol to be explicit about element wise multiplication.

We want to perform gradient decent to optimize both $W$ and $H$ :

W_{t + 1} = W_{t} - η_{W} \circ \nabla_{W} D (W_{t})

H_{t + 1} = H_{t} - η_{H} \circ \nabla_{H} D (H_{t})

For NMF a reasonable loss function is:

D (X, W, H) = | | X - W H | |_{F}^{2}

Where $| | \cdot | |_{F}$ is the Frobenius Norm, which we can rewrite in terms of the trace operator.

\begin{aligned} | | X - W H^{⊺} | |_{F}^{2} & = t r ((X - W H)^{⊺} (X - W H)) & Definition of trace for matrix norm \\ = t r ((X^{⊺} - H^{⊺} W^{⊺}) (X - W H)) & Transpose property \\ = t r (X^{⊺} X - X^{⊺} W H - H^{⊺} W^{⊺} X + H^{⊺} W^{⊺} W H) & Expand \end{aligned}

And then comes the fun part of evaluating the loss function with respect to W and H

\begin{aligned} \nabla_{W} | | X - W H | |^{2} & = \nabla_{W} (t r (X^{⊺} X) - t r (X^{⊺} W H) - t r (H^{⊺} W^{⊺} X) + t r (H^{⊺} W^{⊺} W H)) \\ = 0 - (H X^{⊺})^{⊺} - X H^{⊺} + W ((H H^{⊺})^{⊺} + H H^{⊺}) \\ = - 2 X H^{⊺} + 2 W H H^{⊺} \end{aligned}

\begin{aligned} \nabla_{H} | | X - W H | |^{2} & = \nabla_{H} (t r (X^{⊺} X) - t r (X^{⊺} W H) - t r (H^{⊺} W^{⊺} X) + t r (H^{⊺} W^{⊺} W H)) \\ = 0 - (X^{⊺} W)^{⊺} - W^{⊺} X + (W^{⊺} W + (W^{⊺} W)^{⊺}) H \\ = - 2 W^{⊺} X + 2 W^{⊺} W H \end{aligned}

Whew, now we can take these two results and substitute them back into our gradient decent equations for W and H:

W_{t + 1} = W_{t} + η_{W} \circ (X H_{t}^{⊺} - W_{t} H_{t} H_{t}^{⊺})

H_{t + 1} = H_{t} + η_{H} \circ (W_{t}^{⊺} X - W_{t}^{⊺} W_{t} H_{t})

Note that the factor of two was "absorbed" into the learning rate (you could also add a 1/2 factor to the loss function, it doesn't really matter). There is an issue with the equations we've written down so far. They contain negative terms! Luckily we still have to define the learning rate parameters which we can set to anything we like. Lee and Seung proposed defining the learning rates as:

η_{W} = \frac{W}{W H H^{⊺}}

η_{H} = \frac{H}{W^{⊺} W H}

which when we substitute into the descent equation gives some nice cancelation:

\begin{aligned} W_{t + 1} & = W_{t} + (\frac{W_{t}}{W_{t} H_{t} H_{t}^{⊺}}) \circ (X H_{t}^{⊺} - W_{t} H_{t} H_{t}^{⊺}) \\ = W_{t} + W_{t} \circ \frac{X H_{t}^{⊺}}{W_{t} H_{t} H_{t}^{⊺}} - W_{t} \circ \frac{W_{t} H_{t} H_{t}^{⊺}}{W_{t} H_{t} H_{t}^{⊺}} \\ = W_{t} \circ \frac{X H_{t}^{⊺}}{W_{t} H_{t} H_{t}^{⊺}} \end{aligned}

\begin{aligned} H_{t + 1} & = H_{t} + (\frac{H_{t}}{W_{t}^{⊺} W_{t} H_{t}}) \circ (W_{t}^{⊺} X - W_{t}^{⊺} W_{t} H_{t}) \\ = H_{t} + H_{t} \circ \frac{W_{t}^{⊺} X}{W_{t}^{⊺} W_{t} H_{t}} - H_{t} \circ \frac{W_{t}^{⊺} W_{t} H_{t}}{W_{t}^{⊺} W_{t} H_{t}} \\ = H_{t} \circ \frac{W_{t}^{⊺} X}{W_{t}^{⊺} W_{t} H_{t}} \end{aligned}

And these are the two update rules for NMF:

W_{t + 1} = W_{t} \circ \frac{X H_{t}^{⊺}}{W_{t} H_{t} H_{t}^{⊺}}

H_{t + 1} = H_{t} \circ \frac{W_{t}^{⊺} X}{W_{t}^{⊺} W_{t} H_{t}}

Essentially you pick two random starting points for $W$ and $H$ , then you use these equations to iteratively update the solution until the solution stops changing significantly.

Solving NMF with Ordinary Differential Equations

Let's look back at our gradient decent equation:

θ_{t + 1} = θ_{t} - η \circ \nabla D (θ_{t})

instead of taking discrete steps toward a local minima, we could look at the limit as we take smaller and smaller step sizes:

\begin{aligned} θ_{t + 1} & = θ_{t} - Δ t \circ \nabla D (θ_{t}) & Set η to Δ t \\ lim_{Δ t \to 0} \frac{θ_{t + 1} - θ_{t}}{Δ t} & = - \nabla D (θ_{t}) & Take limit \\ \frac{d θ}{d t} & = - \nabla D (θ_{t}) & Definition of derivative \end{aligned}

This approach is known as gradient flow. When applied to NMF, we end up with a system of coupled differential equations:

\frac{d W}{d t} = W \circ (\frac{X H^{⊺}}{W H H^{⊺}} - 1)

\frac{d H}{d t} = H \circ (\frac{W^{⊺} X}{W^{⊺} W H} - 1)

Here I'm trying to be explicit about the fact that we are doing an element-wise multiplication, division, and subtraction. $1$ represents a matrix of ones with the appropriate size for each equation.

Head-to-Head Comparison

To compare the discrete and continuous versions of these NMF algorithms, I want to focus on overall accuracy as opposed to speed (which will boil down to the number of matrix multiplications performed). I will perform the factorization on the following matrix which defines a circle of random numbers.

using Plots

n,m = (400,500)
k = 40
W0 = rand(n,k)
H0 = rand(k,m)

X = [(x-200)^2+(y-200)^2>50^2 ? rand() : 10*rand() for x in 1:n, y in 1:m]

heatmap(X,clims=(0,10))

The ODE method is going to automatically choose the step size based on the solver choice. To make a fair comparison, I will run the multiplicative update for the same number of iterations.

ODE Implementation

using LinearAlgebra
using OrdinaryDiffEq
using RecursiveArrayTools

# Define ODEs
function nmf!(du, u, p, t)

    X, WH, XHᵀ, WHHᵀ, WᵀX, WᵀWH = p
    W = u.W
    H = u.H

    mul!(WH,   W,  H)
    mul!(XHᵀ,  X,  H')
    mul!(WHHᵀ, WH, H')
    mul!(WᵀX,  W', X)
    mul!(WᵀWH, W', WH)

    @. du.W = W * (XHᵀ / WHHᵀ - 1)
    @. du.H = H * (WᵀX /WᵀWH  - 1)
end

WH = similar(X)
XHᵀ = similar(W0)
WHHᵀ = similar(W0)
WᵀX = similar(H0)
WᵀWH = similar(H0)

p = (X, WH, XHᵀ, WHHᵀ, WᵀX, WᵀWH)
u0 = NamedArrayPartition((;W=W0,H=H0))
tspan = (0.0, 1000.0)

#Solving
prob = ODEProblem(nmf!, u0, tspan, p)
sol = solve(prob,ROCK4())

I tried a few different ODE solvers and found ROCK4 to be very efficient.

Multiplicative Update Implementation

mutable struct NMFMU{M}
    X::M
    W::M
    H::M
end

nmf = NMFMU(copy(X),copy(W0),copy(H0))

function MU!(nmf::NMFMU)

    nmf.W .= nmf.W .* (nmf.X * nmf.H') ./ (nmf.W * nmf.H * nmf.H')
    nmf.H .= nmf.H .* (nmf.W' * nmf.X) ./ (nmf.W' * nmf.W * nmf.H)

    return nothing
end

This is very inefficient implementation, for an optimized version please use the NMF.jl package.

Loss over Iterations

#error  
loss(X,W,H) = norm(X-W*H) / norm(X)
loss(nmf) = loss(nmf.X, nmf.W, nmf.H)

#ODE
ode = [loss(X,sol[i].W,sol[i].H) for i in eachindex(sol)]
pushfirst!(ode,loss(X,W0,H0))

  
#MU
mu = [loss(X,W0,H0)]

for _ in eachindex(sol)
    MU!(nmf)
    push!(mu,loss(nmf))
end

plot(ode, label="ODE", linewidth=3, xlabel="Iterations", ylabel="Error")
plot!(mu, label="MU", linewidth=3)

The first thing I notice is the first step each algorithm. MU takes a huge step in the right direction whereas the ODE method is very conservative in its decent. More surprising is that the final solutions are different. At least for this kind of random matrix, the ODE method is able to find a better solution. Maybe MU would eventually end up at the same solution? I've ran these methods for more iterations but this does not seem to be the case.

There are a lot of improvements and variations we can do with this system like accelerated gradient decent, translating more efficient NMF algorithms, and investigating steady state solutions. Hopefully this will lead to better algorithms for NMF!