# On the Awesomeness of Normalized Cross-Correlation

Ode to a pattern matching algorithm for images

My beloved algorithm, I devote this article to you personally. You saved my honor in an important team project. I owe you many points acquired in a competition, thanks to you. You are the cornerstone of a job assignment of mine. It is you I remember when I think of pattern recognition. You in your simpleness and numerical stability are the strong link between sensed reality and information theory. You supply top-class service yet you get very little recognition from the general public. Hereby, I would like that to change.

## Applications

Normalized cross-correlation is a rather simple formula that describes the similarity of two signals. As such, it serves well for searching a known pattern in an image. You can use it when looking for a specific face in a photograph or for a letter in a scanned document. We can use it to count a lot of repetitive stuff in an aerial photo, such as solar pannels. However in its pure generality, the formula can be used to countless tasks completely unrelated to computer vision.

A downside is that cross-correlation can hardly account for rotation or other deformations. If you need to find a way around that, it will usually not be by far as pretty. Good aspects follow.

Each possible location of the searched pattern is evaluated, and we can even get results more precise than the pixel level. Cross-correlation might well be the simplest thing to calculate for the purpose and the normalization we use is the most profound normalization all around. That formula to evaluate is one single step and outputs clear values between −1 and 1. No parameters involved. Apart from simple arithmetics, the only computation performed is Fast Fourier Transform. You may read that as "blazing fast".

## Intuition

To make things simple, imagine that the images consist of pure white and pure black only. We may be searching for a white letter A in a large document. For each pixel of the document – that is, for each supposed top-left corner of our letter A – we want to assess if the letter A is placed there or not. An interesting value to calculate is the number of matching pixels minus the number of non-matching ones. The maximum value possible is the total count of pixels in the letter and that means a perfect match. The minimum occurs if the image is exactly inverted and the value is again the same, just with a minus sign. Everything else will produce something in between, in correspondence to the similarity of the two images. If one of the images is just noise, the result will be around zero.

The reason to choose specifically the procedure mentioned above is its beauty from an arithmetical point of view. If we denote white pixels by 1 and black ones by −1, the result is just a per-pixel product of letter A with the image, summed up. (Check for yourself: −1 · −1 = 1.) A wonderful thing about this approach is that it can be used to classical images without any modification. Black is −1, white is 1 and shades of gray are the values in between. Multiplication is multiplication. There comes a problem, however.

Suppose that the letter A we search for consists mostly of gray (that is, values around zero). If we multiply it by a factor of two, positive values will get brighter and negative ones darker, resulting in increased contrast overall. This high-contrast letter, if placed somewhere within the image, will produce twice the value of a perfect match (yes, the math is that simple so far). Obviously, that is something we do not wish to happen. The perfect match must always be the winner.

A secure way to go is to make sure that modifications of contrast have no effect at all on the resulting values. That means the original letter will produce exactly the same value as the high-contrast one, which seems quite fair. In a similar manner, we can account for changes in brightness, as if somebody added a constant value to each of the pixels. As you can imagine, brightening an image from the black-gray scale (from −1 to 0) into the gray-white scale (from 0 to 1) would cause a paradox much worse than the one we just solved. We need to, and shall, get all these things out of our way.

## Formulation

The formula of classical cross-correlation of image $I$ with mask $M$ at position $x, y$ is a two-dimensional sum: ${\sum }_{i, j}I\left(x+i, y+j\right) · M\left(i, j\right)$. Exactly as proposed, it is just the sum of per-pixel products of the image and the mask. The indices $i, j$ are summed across a range such as not to exceed the dimensions of the mask.

The first step of normalization will be cancelling out modifications of brightness. Brute-force approach is okay since we are still in the area of theory. We calculate the average value of both the mask and the corresponding area in the image, and we subtract this value from each of the respective pixels. Average value is something like this: $\frac{{\sum }_{i, j}M\left(i, j\right)}{area\left(M\right)}$, so the modified version of our formula reads as follows: . Note that the average of only the overlayed portion of the image is used.

Before things get to be too nasty, let us get rid of the indices. Just as a matter of aesthetics, the formula shall be written in the following way: $\sum \left(Iʹ - a · \sum Iʹ\right) · \left(M - a · \sum M\right)$. The constant is just a shorthand. Note that instead of the whole image $I$, we only consider the small cut-out $Iʹ$ around the mask, so $area\left(Iʹ\right) = area\left(M\right)$.

Now we need to calculate and cancel out the contrast. A suitable formula is $contrast\left(M\right) =\sqrt{\sum M · M}$. You can verify that contrast of $2 · M$ evaluates to $2 · contrast\left(M\right)$, and that was the point. Not forgetting we have to apply brightness normalization first and then contrast normalization, we arrive at the formula $\sum \frac{\left(Iʹ - a · \sum Iʹ\right)}{contrast\left(Iʹ - a · \sum Iʹ\right)}·\frac{\left(M - a · \sum M\right)}{contrast\left(M - a · \sum M\right)}$. Finally, we substitute for the function $contrast$, obtaining:
$normxcorr\left(Iʹ, M\right) = \sum \frac{\left(Iʹ - a · \sum Iʹ\right)}{\sqrt{\sum \left(Iʹ - a · \sum Iʹ\right)· \left(Iʹ - a · \sum Iʹ\right)}}·\frac{\left(M - a · \sum M\right)}{\sqrt{\sum \left(M - a · \sum M\right) · \left(M - a · \sum M\right)}}$.
You can verify that $normxcorr\left(M, M\right)$ will always equate to $1$.

One step further would be to put back in the indices of $i, j$ and make it all into a function of coordinates $x, y$ but I will spare you of that.

## Further tweaking

We can generalize this thing to RGB images using dot product and to higher polynomial functions than simple multiplication. Until I finish this article, you have to find out the particularities for yourself.

## Efficient calculation

The trick is that each of the sums in the formula can be calculated using the convolution theorem and the Fast Fourier Transform algorithm. Until I write this down, you can look into the wikipedia article to find out some interesting facts about the Fourier transform (not that it would help you programming, though).

Another trick helps if the reference image is huge and the pattern is tiny, since FFT could easily be outperformed by the stupidmost for-loop calculation of the formula. The thing to do is a tile-based hybrid approach – in each tile, you do the convolution trick with FFT and then you crop out meaningful parts of all the tiles and stitch them together. The area to cut out is as wide as the pattern so the optimal tile size also depends on the pattern size. The overall complexity is linear, which is rather cool.

There is yet another hint, much simpler: precalculate all you can.

## Uses outside pattern recognition

If we forget about Fourier transform for this moment, the formula of normalized cross-correlation need not be limited to pixel-by-pixel search. We can try to search for the correct pattern not only by shifting it somewhere but also by many other transformation (that is, if we do not care much about execution speed).

One of such wonderful applications that I had the honor to use was polygon simplification. The input of this task is a grayscale image that we want to approximate by a polygon with few vertices, if possible. I do not want to delve into much details, but we can basically perform a minimization task on the positions of the vertices. To get a result as simple as possible, we can make a heavy penalty on the count of resulting vertices. Problem is, what function to optimize? Normalized cross-correlation of the polygon versus the original grayscale bitmap turns out to be a great choice. Rasterization of a 2D polygon is rather an easy task that can be further optimized to the bones so that it only outputs the cross-correlation score.

## Example implementation

Far from perfect, but so far you can copy from this piece of code. It uses OpenCV 2 and its two parameters should be filenames of two images, the second of which is the pattern to be searched. The program displays the first image with all occurences of the pattern marked by black-white dots.

#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>

typedef cv::Mat FloatMat;
#define DEBUG
#ifdef DEBUG
FloatMat visualization;
FloatMat glob_image2;
FloatMat glob_glyph;
#endif

inline float pow2(float x)
{
return x * x;
}

struct Spectrum
{
Spectrum(const FloatMat &image):
Spectrum(image.clone(), image.size())
{
}
Spectrum(FloatMat &&image, const cv::Size &size)
{
time = image;
FloatMat tmp(size, CV_32FC1, cv::Scalar(0));
image.copyTo(tmp(cv::Rect(0, 0, image.cols, image.rows)));
cv::dft(tmp, freq, cv::DFT_COMPLEX_OUTPUT, image.rows);
}
FloatMat time, freq;
};

FloatMat correlate(const Spectrum &image, const Spectrum &glyph)
{
FloatMat correlation;
cv::mulSpectrums(image.freq, glyph.freq, correlation, 0, true);
cv::dft(correlation, correlation, cv::DFT_SCALE | cv::DFT_INVERSE | cv::DFT_REAL_OUTPUT, 0);
FloatMat blockSum;
// TODO optimize the boxFilter out of this function
cv::boxFilter(image.time, blockSum, -1, glyph.time.size(), cv::Point(0, 0), false, cv::BORDER_REPLICATE);
FloatMat blockNorm = image.time.mul( image.time );
cv::boxFilter(blockNorm, blockNorm, -1, glyph.time.size(), cv::Point(0, 0), false, cv::BORDER_REPLICATE);
int area = glyph.time.cols * glyph.time.rows;
FloatMat result = blockNorm - blockSum.mul(blockSum) / area;
cv::sqrt(result, result);
cv::divide(correlation, result, result);
return result;
}

{
FloatMat supp;
assert (image.type() == CV_32F);
for (int i=0; i<image.rows; i++) {
float
*irow = image.ptr<float>(i),
*mrow = supp.ptr<float>(i);
for (int j=0; j<image.cols; j++) {
if (mrow[j] > irow[j])
irow[j] = 0;
}
}
}

FloatMat convert(cv::Mat colorInput)
{
assert (colorInput.cols > 0 and colorInput.rows > 0);
FloatMat grayOutput;
cv::cvtColor(colorInput, grayOutput, CV_BGR2GRAY);
grayOutput.convertTo(grayOutput, CV_32F, 1./255);
return grayOutput;
}

void clamp(FloatMat &image, float low, float high)
{
for (int y = 0; y < image.rows; y++) {
float *row = image.ptr<float>(y);
for (int x = 0; x < image.cols; x++) {
float val = row[x];
row[x] = (val > high) ? high : ((val > low) ? val : low);
}
}
}

int main(int argc, char ** argv)
{
if (argc < 3) {
return 1;
}
cv::Scalar shift = cv::sum(glyph);
shift *= -1.f / (glyph.rows * glyph.cols);
glyph += shift;
glyph /= cv::norm(glyph);
Spectrum image(gray);
nonMaximumSuppression(correlation, 3);
int count = 0;
for (int y = 0; y < correlation.rows; y++) {
float *densityRow = correlation.ptr<float>(y);
for (int x = 0; x < correlation.cols; x++) {
if (densityRow[x] > 0.8) {
count += 1;
cv::circle(gray, cv::Point (x,y), 1, cv::Scalar(1.0), -1, 8);
gray.at<float>(y, x) = 0.0;
}
}
}
printf("%i blobs.\n", count);
cv::imshow("blobs", gray);
while (char(cv::waitKey()) != 27);
return 0;
}