The Richardson-Lucy Algorithm

Deconvolution by the Richardson-Lucy algorithm is achieved by minimizing the convex loss function derived in the last article

(1)   \begin{equation*} J(O) = \sum \bigg (O**PSF - I\cdot ln(O**PSF) \bigg) \end{equation*}

with

  • J, the scalar quantity to minimize, function of ideal image O(x,y)
  • I(x,y), linear captured image intensity laid out in M rows and N columns, corrupted by Poisson noise and blurred by the PSF
  • PSF(x,y), the known two-dimensional Point Spread Function that should be deconvolved out of I
  • O(x,y), the output image resulting from deconvolution, ideally without shot noise and blurring introduced by the PSF
  • **   two-dimensional convolution
  • \cdot   element-wise product
  • ln, element-wise natural logarithm

In what follows indices x and y, from zero to M-1 and N-1 respectively, are dropped for readability.  Articles about algorithms are by definition dry so continue at your own peril.

So, given captured raw image I blurred by known function PSF, how do we find the minimum value of J yielding the deconvolved image O that we are after?

One way would be to try all possible combinations of pixel intensities in image O calculating the loss function J each time with Equation (1), then choose the image associated with the lowest J.  However, image O these days would typically be made up of several tens of millions of 16-bit pixel values – so until we get our hands on a quantum computer, trying them all would take (almost) forever.

Minimum when Gradient = Zero

However, we are told that J(O) is a convex function and all convex functions have a minimum (recall from the last article that we are minimizing the unlikelihood).  We also remember from high school that in order to find the minimum of a convex curve we take its derivative and set it equal to zero.  It works the same with 2D and higher order surfaces, we simply compute their gradient (\nabla) by partial derivative and set that equal to zero.  As explained in Dey et al[3]  the gradient of function J(O) is

(2)   \begin{equation*} \nabla J(O) = \bigg[ 1 -\cfrac{I}{O**PSF}\bigg] ** PSF^T. \end{equation*}

The fraction above represents element-wise division of the resulting image intensities, ** two dimensional convolution and  ^T the adjoint operator, which in our real positive case is equivalent to flipping the function horizontally and vertically: PSF^T = PSF(-x,-y).

But wait a second, this is starting to look very much like a multivariate linear regression problem: when we have a convex function J(O) and its gradient \nabla J(O), aren’t there tried and true algorithms pretty well guaranteed to converge to a solution in more or less efficient fashion?

RL and Gradient Descent: the Additive Solution

Yes there are, in fact there is one algorithm much beloved in Machine Learning called Batch Gradient Descent: it basically keeps going down the slope of the surface, like a bunch of ping pong balls until they stop at the bottom of the hill.  It looks like this:

(3)   \begin{equation*} O_{k+1} = O_k - \alpha \cdot \nabla J(O_k) \end{equation*}

with the new iteration of image parameters (O_{k+1}) equal to the last iteration (O_k) minus a small proportion (\alpha) of the relative gradient (\nabla J).

Think of each pixel as having a gradient towards its ideal state, then taking a step down the slope in that direction.  At every iteration the algorithm takes stock of its new situation and recalculates the gradient for all pixels simultaneously, in a batch process.

How big of a step is taken each time is controlled by the factor \alpha, known as the learning rate in ML: the smaller it is, the smaller the relative steps it takes on the way down to the bottom.  Smaller steps mean a more controlled descent but also a slower speed getting to the solution.  Typical values for \alpha for our purposes tend to be in the 1-10% range.

So in the end the iterative Gradient Descent solution is fairly easy to implement, given capture image I and a known PSF to deconvolve: seed O_k initially to be equal to I, determine the gradient \nabla J(O_k) with Equation (2) , update O per Equation (3) – and repeat until convergence.

Because of the nature of the cost function J(O), each residual should be smaller than the last and convergence is achieved when O_{k+1} stops changing appreciably compared to the previous iteration, indicating that J has leveled off in its minimum, yielding as a byproduct the desired output image deconvolved in the presence of Poisson noise.  Ta-dah, Richardson-Lucy meets Machine Learning.

LR Multiplicative Solution

An alternative iterative approach, and the one most seen in software currently in use, is the multiplicative solution to the Richardson-Lucy deconvolution problem.

It works like this.  First assume that the PSF is normalized so that its integral is equal to one.  This is necessary anyways in order not to change mean image intensity during the convolutions, in other words 1**PSF=1.   We then set the gradient \nabla J(O) from Equation (2) equal to zero to find the location of the minimum of the joint loss function.  Taking advantage of 2D convolution’s commutative, associative and distributive properties  we obtain

(4)   \begin{equation*} 1 = \cfrac{I}{O**PSF} ** PSF^T \end{equation*}

1 here represents an image-sized array of ones.  The fact that the loss function J(O) is convex, it decreases after each iteration, the residuals always get smaller[2] and, when converged, further iterations should ideally stop changing the resulting image – means that at convergence O_{k+1} should effectively be equal to O_k , or \frac{O_{k+1}}{O_k} = 1.  Substituting this last bit into Equation (4) yields the Multiplicative RL solution

(5)   \begin{equation*} O_{k+1} = O_k  \cdot \bigg [\cfrac{I}{O_k**PSF} ** PSF^T \bigg ] \end{equation*}

where the product and division above are element-wise, ** represents 2D convolution and  ^T the adjoint operator as before.

Seed starting parameters in O_k with random values (or a mid-gray image, or I ) and proceed like in the additive case, stopping when O_{k+1} essentially no longer changes or J levels off.

Number of Iterations vs Noise

The output image O_{k+1} improves after each iteration as J approaches its minimum.  The stage at which visible improvements stop occurring is typically achieved quickly, after only a few iterations.  Ideally we would stop the algorithm right there and then because from that point on the sharpness of the image itself will not improve appreciably but noise will get worse as a result of all the additional processing.

Appreciably‘ is perceptual and not well defined.  In fact where to stop is a subjective, image dependent compromise: in some cases it would be better to stop the algorithm early in order to keep noise in check after the restoration gave quick perceptual improvements, in others to let it go longer.  Automatic stopping criteria are typically not particularly good at making this choice, the record is littered with papers on the subject.  This is the reason why many software packages allow the user to control the number of iterations manually and/or provide the ability to mask out areas of the image that would not benefit from deconvolution.

The compromise could be better managed through regularization priors built into the cost function, think of them as targeted shock absorbers, a topic we will leave for another time.[3]

Implementation Notes

Variations on the multiplicative algorithm are how Richardson-Lucy deconvolution is implemented in many commercial packages today because it tends to converge faster.

In many cases the PSF varies considerably throughout the field of view and it is not known precisely.  So in photographic applications it is usually guessed to be a generic Gaussian shape with standard deviation an input parameter chosen by the user, typically labeled ‘radius’ in units of pixels.  See for example PhotoShop Smart Sharpen ‘Gaussian Blur’ with ‘More Accurate’ ticked or, better, open source RawTherapee.[4]

What they are really correcting there is a milkshake of effects and aberrations including blur introduced by diffraction, defocus, axial CA, AA (if present), demosaicing and more – all of which also vary more or less with wavelength, hence color channel.  The silver lining in all this guessing and imprecision is that the more numerous and diverse the blurring functions one throws into the milkshake, the more the aggregate PSF should start to look Gaussian thanks to the central limit theorem.  This is also partly the reason why deconvolution is normally applied to a, preferably linear, ‘luminance’ (Y) component as opposed to the individual R, G, B color planes.

Of course where the PSF‘s radius or shape is not accurate, deconvolution will produce unwelcome artifacts like halos and ringing.   When used in moderation, however, it is amazing how well it works in practice with typical landscape captures.

 

Notes and References

1. Bayesian-Based Iterative Method of Image Restoration, William Hadley Richardson, Optical Society of America (1972).
2. An iterative technique for the rectification of observed distributions, Lucy, L. B., Astronomical Journal, Vol. 79, p. 745 (1974)
3. 3D Microscopy Deconvolution using RL and Total Variation Regularization, Dey et al. (2004). Appendix C has a good review of the theory underlying this and similar algorithms.
4.  As  of this writing open source converter RawTherapee provides two implementations of RL deconvolution in its toolkit: as a method in the Detail/Sharpening tab – and as part of the Raw tab under the Capture Sharpening header. The latter is better and applied earlier in the pipeline on linear data. Ingo Weyrich, its developer, says that he uses gaussian kernels of various sizes depending on the selected radius (standard deviation): 3×3, 5×5, 7×7, 9×9 and 13×13 for radii less than 0.6, 0.84, 1.15, 1.5 and 2 pixels respectively.

12 thoughts on “The Richardson-Lucy Algorithm”

  1. I came across your site almost by accident and am hugely impressed by the clarity of your posts … you should write a textbook on this. Of course, perhaps you have! Anyway, keep up the good work. It is a pleasure to read such clear descriptions.

    1. Why, thank you Chris, my pleasure. And no book, the breadth and depth of the subject matter is overwhelming. Who knew, when I picked up my first DSLR in 2007.

  2. Jack,
    Love the articles, I am learning a lot. I am a research geoscientist, and work a lot with inverse problems that are vaguely like this, though I must confess I know little about optics. My problems are vaguely like this in that they have noisy observations, are spatially correlated, and involve inversion which is really just another word for deconvolution. IMO. With that caveat, I’m suspicious of max likelihood methods, and have come to rely more on sampling methods like Markov chain Monte Carlo. I suspect you’re aware of these methods, they are less “greedy” and prone to choosing false optima. Have you played with these methods? (I don’t use MATLAB, but can sorta read it; I use R, and Python, which have similar features but frustratingly different syntax…but are free, which matters to poorly funded academics like me).
    Thanks again for fabulous stuff, Chris

    1. Hi Chris, thanks for the encouraging words.

      I like ‘inversion’. The first time I came across the term deconvolution was with seismic/acoustic survey data in a previous life. We performed it in the field inside a truck featuring dual PDP-11s 🙂

      I have never explored Markov chain and Monte Carlo for inversion type problems. I would be interested in learning more if you can suggest non-paywalled sources relevant to image processing.

      Jack

  3. Hi Jack,

    Thank you for your excellent posts and I really enjoy reading it!
    I’ve forgotten most of stuff taught in university mathematics, but the way you explain things is clear and concise.

    I have a naive question: It seems that iterative methods are preferred in many software, is it prohibitively hard to solve equation (2) directly when it is set to equal to 0?

    thanks again for your time,
    JJ

    1. Hello JJ,

      If we do that we need to solve Equation (4), so we have to undo a couple of convolutions. Probably the easiest way to do that is in the frequency domain – but then we run into the noise issues outlined in the naïve deconvolution article. And getting around those is how RL came to be.

      Jack

  4. Hello J,
    Excuse me ,Could it be explained more clearly? I don’t understand the phrase: ‘The fraction above represents element-wise division of the resulting image intensities.’

    1. Hi Chen, I mean that the 2D image linear intensities as captured are divided element-by-element by the guesstimated 2D ideal image intensities convolved with the Point Spread Function . The resulting division is a 2D array of the same size as the numerator and the denominator.

  5. Hi Jack,

    Just wanted to say a big thank You – I come from the field of Mass Spectrometry (proteins, fats, etc.) and was perplexed by the algorithms used to analyse our data. Thank You so much for such clarity, I’ve not found a better article explaining it.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.