## Linear Addition from the Log Domain

Some speech recognizers and machine learning algorithms need to quickly calculate the quantity:

$\log(x+y)$

when given only $\log(x)$ and $\log(y)$. The straightforward algorithm first uses the exponential function to convert the log values to the linear domain representation $x$ and $y$, then performs the sum, and finally uses the log function to convert back to the log domain:

$\log(x+y) = \log(\exp(\log(x)) + \exp(\log(y)))$

These conversions between log and linear domains are slow and problems arise when $x$ is too large or small for a machine’s floating point representation.

Luckily, there is a clever approximation method that allows quick, accurate calculation with a relatively small precomputed table.

## Derivation

The table-based approximation can be derived by deciding to first look for a function $f(x,y)$ such that:

$\log(x+y) = \log(x) + f(x,y)$

Let’s consider the case where $x \geq y$, without loss of generality. It looks like this:

On the left, $\log(x+y)$ and $\log(x)$ are plotted in 3D, for $x \geq y$. The slice where $y=1$ is shown in the 2D plot on the right. The difference between the two curves, shaded on the right, is the contribution from $f(x,y)$.

Via algebra and some logarithm identities, solve for $f(x,y)$ in terms of $\log(x)$ and $\log(y)$:

$\log(x+y) = \log(x) + \log(1+\exp(\log(y)-\log(x)))$

The function $f(x,y)$ can be pre-computed and stored in a table $T[i]$. A table is particularly apt because it is:

• Easily indexed as a function of $\log(x)-\log(y)$, since this is a non-negative number
• Reusable across the entire range of $\log(x)$, since it depends only on the relative difference between $\log(x)$ and $\log(y)$
• Compact, since the contribution is small when $x \gg y$

The table $T[i]$ is pre-calculated as:

$T[i] = \log(1 + \exp(-\frac{i}{\omega})) \qquad 0 \leq i \leq \Omega$

and used at run-time as:

$\log(x+y) = \log(x) + T[\lfloor\omega\cdot(\log(x)-\log(y))+\phi\rfloor]$

Where a simple and quick-to-compute discretization function has been defined to round $\log(x)-\log(y)$ to table indexes. Let’s call the parameters sampling frequency ($\omega$), rounding threshold ($\phi$), and table size ($\Omega$). These affect the approximation accuracy and are described in the next section. For now, the figure below shows the original function and the table approximation when $\omega=1$ and $\phi=0$:

On the left, $\log(x+y)$ and $\log(x)+T[\cdot]$ are plotted in 3D, for $x \geq y$. The slice where $y=1$ is shown in the 2D plot on the right. Note the discontinuities introduced by the floor function in the approximation.

## Approximation Error

Approximation error is affected by the table length and the discretization function. These are parameterized by the quantities:

• $\Omega$ – table length
• $\phi$ – rounding threshold
• $\omega$ – “sampling frequency”

Since this blog post is stunningly long already, $\omega$ won’t be analyzed in detail. Predictably, increasing the sampling frequency decreases the approximation error at the cost of storing more samples in the table. Since the largest error occurs at the start of the table, increasing $\omega$ is the best way to lower the maximum error without changing the discretization function.

### Table length and $\Omega$

Longer tables are theoretically better than shorter tables. However, long tables can hurt cache performance and, after a certain point, the table contribution is so small that ignoring it barely affects error. Limiting the size of the table is a smart speed/precision trade-off.

It is possible to provide an upper-bound on the largest useful table when the answer to $\log(x+y)$ is stored in a limited precision representation. Consider that the smallest value that can be represented in single-precision floating point is $2^{-149}$. Be generous and allow all intermediate calculations to use arbitrary precision, and allow rounding the result before storing it in single-precision as the final answer. If the table element used contributes less than half of the smallest single-precision-representable value, this table element can never affect the stored result:

$\frac{1}{2} \cdot 2^{-149} > \log(1+\exp(-\frac{\Omega+1}{\omega}))$

And solving for $\Omega$ gives:

$\Omega > -\omega \log(\exp(2^{-150})-1)-1$

Which is the surprisingly small $\Omega=103$ when $\omega=1$.

### Rounding to an index and $\phi$

The table index is proportional to $\log(x)-\log(y)$. The parameter $\phi$ determines the discretization of this quantity, so that it can be mapped to an integer and used as a table index. When $\phi=0$, the discretization operation is truncation. When $\phi=0.5$, the discretization operation is standard rounding.

If fast evaluation is the one and only goal, use truncation by setting $\phi=0$. Otherwise, consider the error due to the table approximation:

$|\log(1+\exp(-\frac{\lfloor \omega d + \phi \rfloor}{\omega})) - \log(1+\exp(-d))|\qquad 0\leq d \leq \Omega$

Which, for $\omega=1$, looks like:

Two views of the table approximation error as $d$ and $\phi$ vary. Error is greatest at small indexes. Error is predictably zero at the table indexes because these are the recorded samples from the underlying function. Note that, besides looking neat, that we can see some values of $\phi$ are obviously better than others.

An looking at the error at the start of the table at particular values of $\phi$ gives:

Table approximation error for $\phi=0$ and $\phi=0.5$. Note the approximation error is zero at the table indexes and that truncation introduces worse error than rounding.

One choice of $\phi$ would be the value that minimizes the total table error:

$\text{argmin}_{\phi} \int_0^{\Omega} |\log(1+\exp(-\frac{\lfloor \omega d + \phi \rfloor}{\omega})) - \log(1+\exp(-d))|\ \mathrm{d}d$

This integral is difficult to evaluate and minimize, in part due to the floor function and the discontinuities it introduces. Frequency and table length both affect the optimal value. The figures below illustrate the behavior of this equation. In an appendix at the end of this post, near-optimal values of $\phi$ are provided for different settings of $\omega$.

Discretization error for different phi and omega. The piecewise horizontal function is defined by the table approximation; its values are the sampled values from the continuous function. At low frequencies, the continuous function is curvy enough that error is minimized when $\phi$ is set a tad above the value that would give standard rounding. Here, $\phi=0.57$ is very close to the optimal value. However, as the sampling frequency increases, the continuous function is well-approximated locally by straight lines. So the optimal value of $\phi$ approaches $\phi=0.5$, which is standard rounding.

The estimated optimal value of $\phi$ as $\omega$ varies:

Estimated optimal $\phi$ as $\omega$ varies. See the table in the appendix for the source data. Note that as $\omega$ becomes large, traditional rounding gives lower errors.

## Log Base trick

Changing the logarithm base has the same effect on error as changing the sampling frequency $\omega$. If the value of the logarithm base is not important in this part of the application, save a multiplication by setting $\omega=1$ and choosing the log base that achieves the same approximation error.

An attempt at an intuitive, practical explanation is that increasing $\omega$ maps the set of values $\lfloor \omega \cdot \log(x)-\log(y) \rfloor$ to a larger set of integers. Changing the log base has the same effect, since:

$\lfloor \log_b(x) - \log_b(y) \rfloor = \lfloor 1/\log(b) \cdot (\log(x) - \log(y)) \rfloor$

That $1/\log(b)$ is equivalent to $\omega$. Solving for the log base gives $b \approx \{1.105,1.010,1.001\}$ for $\omega=\{10,100,1000\}$.

## Conclusion: Inspiration and Resources

I orginally found this table-based approximation method in the Sphinx 4 source code. It is useful in several places, one of which is in calculating the likelihood of a frame of audio data given a Gaussian mixture model (GMM). This calculation lies in what is effectively the inner-most loop of the speech recognizer. It runs tens of thousands of times each second and speeding it up is a worthwhile optimization.

The source code of LogMath.java and its javadoc documents this beautiful little algorithm well. It would have never caught my attention without that careful documentation. From what I remember, the documentation describes a derivation, some types of log-base tricks, and a similar idea behind how the table length can be bounded. This post builds on those explanations, with a slightly better-framed derivation, an analysis of $\phi$ that is completely new, and pretty pictures.

This post is also much, much longer. I kept finding new and interesting tidbits as this became my hobby for the month. I couldn’t bring myself to leave some of the gems unindexed. And there are still interesting questions left to be answered. Some of these depend on the details of different applications but there are also general questions like: is it ever worth making a 2 dimensional table to approximate the calculation $\log(x+y+z)$?

## Appendix: Table of Good Values for $\phi$

Good values of $\phi$ as $\omega$ varies, on the basis of minimizing error over the first 100 entries in the table:

$\omega$ $\phi$ lower-bound on error
1 0.588644 0.169006
2 0.54489 0.0861034
3 0.529998 0.057602
4 0.522519 0.043254
5 0.518022 0.0346227
6 0.515018 0.0288611
7 0.512875 0.0247426
8 0.511267 0.0216523
9 0.510017 0.0192478
10 0.509073* 0.0173243*
15 0.506008
20 0.504494
25 0.50357
30 0.502952
35 0.5025
40 0.502159
45 0.501895
50 0.501684
100 0.5009* 0.00173275*
200 0.500353
300 0.500245
400 0.50018
500 0.500143
600 0.500115
700 0.500081
800 0.500091
900 0.500019
1000 0.500062* 0.0000950386*

The numbers with asterisks were calculated over the first 1000 table entries. This gives a better estimate of the best value of $\phi$ for a table of this length or longer. With more entries, the optimal value of $\phi$ inches up, but not greatly. However, the error lower-bound is significantly affected when more samples are used, so only a few of the sufficiently-sampled lower-bounds were shown.