Linear Addition from the Log Domain

Some speech recognizers and machine learning algorithms need to quickly calculate the quantity:

\log(x+y)

when given only \log(x) and \log(y). The straightforward algorithm first uses the exponential function to convert the log values to the linear domain representation x and y, then performs the sum, and finally uses the log function to convert back to the log domain:

\log(x+y) = \log(\exp(\log(x)) + \exp(\log(y)))

These conversions between log and linear domains are slow and problems arise when x is too large or small for a machine’s floating point representation.

Luckily, there is a clever approximation method that allows quick, accurate calculation with a relatively small precomputed table.

Derivation

The table-based approximation can be derived by deciding to first look for a function f(x,y) such that:

\log(x+y) = \log(x) + f(x,y)

Let’s consider the case where x \geq y, without loss of generality. It looks like this:

log(x+y) and log(x) shown in 3d and in 2d along the slice y=1

On the left, \log(x+y) and \log(x) are plotted in 3D, for x \geq y. The slice where y=1 is shown in the 2D plot on the right. The difference between the two curves, shaded on the right, is the contribution from f(x,y).

Via algebra and some logarithm identities, solve for f(x,y) in terms of \log(x) and \log(y):

\log(x+y) = \log(x) + \log(1+\exp(\log(y)-\log(x)))

The function f(x,y) can be pre-computed and stored in a table T[i]. A table is particularly apt because it is:

  • Easily indexed as a function of \log(x)-\log(y), since this is a non-negative number
  • Reusable across the entire range of \log(x), since it depends only on the relative difference between \log(x) and \log(y)
  • Compact, since the contribution is small when x \gg y

The table T[i] is pre-calculated as:

T[i] = \log(1 + \exp(-\frac{i}{\omega})) \qquad 0 \leq i \leq \Omega

and used at run-time as:

\log(x+y) = \log(x) + T[\lfloor\omega\cdot(\log(x)-\log(y))+\phi\rfloor]

Where a simple and quick-to-compute discretization function has been defined to round \log(x)-\log(y) to table indexes. Let’s call the parameters sampling frequency (\omega), rounding threshold (\phi), and table size (\Omega). These affect the approximation accuracy and are described in the next section. For now, the figure below shows the original function and the table approximation when \omega=1 and \phi=0:

log(x+y) and log(x) + T[i] shown in 3d and in 2d along the slice y=1

On the left, \log(x+y) and \log(x)+T[\cdot] are plotted in 3D, for x \geq y. The slice where y=1 is shown in the 2D plot on the right. Note the discontinuities introduced by the floor function in the approximation.

Approximation Error

Approximation error is affected by the table length and the discretization function. These are parameterized by the quantities:

  • \Omega – table length
  • \phi – rounding threshold
  • \omega – “sampling frequency”

Since this blog post is stunningly long already, \omega won’t be analyzed in detail. Predictably, increasing the sampling frequency decreases the approximation error at the cost of storing more samples in the table. Since the largest error occurs at the start of the table, increasing \omega is the best way to lower the maximum error without changing the discretization function.

Table length and \Omega

Longer tables are theoretically better than shorter tables. However, long tables can hurt cache performance and, after a certain point, the table contribution is so small that ignoring it barely affects error. Limiting the size of the table is a smart speed/precision trade-off.

It is possible to provide an upper-bound on the largest useful table when the answer to \log(x+y) is stored in a limited precision representation. Consider that the smallest value that can be represented in single-precision floating point is 2^{-149}. Be generous and allow all intermediate calculations to use arbitrary precision, and allow rounding the result before storing it in single-precision as the final answer. If the table element used contributes less than half of the smallest single-precision-representable value, this table element can never affect the stored result:

\frac{1}{2} \cdot 2^{-149} > \log(1+\exp(-\frac{\Omega+1}{\omega}))

And solving for \Omega gives:

\Omega > -\omega \log(\exp(2^{-150})-1)-1

Which is the surprisingly small \Omega=103 when \omega=1.

Rounding to an index and \phi

The table index is proportional to \log(x)-\log(y). The parameter \phi determines the discretization of this quantity, so that it can be mapped to an integer and used as a table index. When \phi=0, the discretization operation is truncation. When \phi=0.5, the discretization operation is standard rounding.

If fast evaluation is the one and only goal, use truncation by setting \phi=0. Otherwise, consider the error due to the table approximation:

|\log(1+\exp(-\frac{\lfloor \omega d + \phi \rfloor}{\omega})) - \log(1+\exp(-d))|\qquad 0\leq d \leq \Omega

Which, for \omega=1, looks like:

Two views of the table approximation error as a 3D plot

Two views of the table approximation error as d and \phi vary. Error is greatest at small indexes. Error is predictably zero at the table indexes because these are the recorded samples from the underlying function. Note that, besides looking neat, that we can see some values of \phi are obviously better than others.

An looking at the error at the start of the table at particular values of \phi gives:

Table approximation error for phi=0 and phi=0.5$

Table approximation error for \phi=0 and \phi=0.5. Note the approximation error is zero at the table indexes and that truncation introduces worse error than rounding.

One choice of \phi would be the value that minimizes the total table error:

\text{argmin}_{\phi} \int_0^{\Omega} |\log(1+\exp(-\frac{\lfloor \omega d + \phi \rfloor}{\omega})) - \log(1+\exp(-d))|\ \mathrm{d}d

This integral is difficult to evaluate and minimize, in part due to the floor function and the discontinuities it introduces. Frequency and table length both affect the optimal value. The figures below illustrate the behavior of this equation. In an appendix at the end of this post, near-optimal values of \phi are provided for different settings of \omega.

Discretization error for different phi and omega

Discretization error for different phi and omega. The piecewise horizontal function is defined by the table approximation; its values are the sampled values from the continuous function. At low frequencies, the continuous function is curvy enough that error is minimized when \phi is set a tad above the value that would give standard rounding. Here, \phi=0.57 is very close to the optimal value. However, as the sampling frequency increases, the continuous function is well-approximated locally by straight lines. So the optimal value of \phi approaches \phi=0.5, which is standard rounding.

The estimated optimal value of \phi as \omega varies:

Estimated optimal phi as omega varies

Estimated optimal \phi as \omega varies. See the table in the appendix for the source data. Note that as \omega becomes large, traditional rounding gives lower errors.

Log Base trick

Changing the logarithm base has the same effect on error as changing the sampling frequency \omega. If the value of the logarithm base is not important in this part of the application, save a multiplication by setting \omega=1 and choosing the log base that achieves the same approximation error.

An attempt at an intuitive, practical explanation is that increasing \omega maps the set of values \lfloor \omega \cdot \log(x)-\log(y) \rfloor to a larger set of integers. Changing the log base has the same effect, since:

\lfloor \log_b(x) - \log_b(y) \rfloor = \lfloor 1/\log(b) \cdot (\log(x) - \log(y)) \rfloor

That 1/\log(b) is equivalent to \omega. Solving for the log base gives b \approx \{1.105,1.010,1.001\} for \omega=\{10,100,1000\}.

Conclusion: Inspiration and Resources

I orginally found this table-based approximation method in the Sphinx 4 source code. It is useful in several places, one of which is in calculating the likelihood of a frame of audio data given a Gaussian mixture model (GMM). This calculation lies in what is effectively the inner-most loop of the speech recognizer. It runs tens of thousands of times each second and speeding it up is a worthwhile optimization.

The source code of LogMath.java and its javadoc documents this beautiful little algorithm well. It would have never caught my attention without that careful documentation. From what I remember, the documentation describes a derivation, some types of log-base tricks, and a similar idea behind how the table length can be bounded. This post builds on those explanations, with a slightly better-framed derivation, an analysis of \phi that is completely new, and pretty pictures.

This post is also much, much longer. I kept finding new and interesting tidbits as this became my hobby for the month. I couldn’t bring myself to leave some of the gems unindexed. And there are still interesting questions left to be answered. Some of these depend on the details of different applications but there are also general questions like: is it ever worth making a 2 dimensional table to approximate the calculation \log(x+y+z)?

Appendix: Table of Good Values for \phi

Good values of \phi as \omega varies, on the basis of minimizing error over the first 100 entries in the table:

\omega \phi lower-bound on error
1 0.588644 0.169006
2 0.54489 0.0861034
3 0.529998 0.057602
4 0.522519 0.043254
5 0.518022 0.0346227
6 0.515018 0.0288611
7 0.512875 0.0247426
8 0.511267 0.0216523
9 0.510017 0.0192478
10 0.509073* 0.0173243*
15 0.506008
20 0.504494
25 0.50357
30 0.502952
35 0.5025
40 0.502159
45 0.501895
50 0.501684
100 0.5009* 0.00173275*
200 0.500353
300 0.500245
400 0.50018
500 0.500143
600 0.500115
700 0.500081
800 0.500091
900 0.500019
1000 0.500062* 0.0000950386*

The numbers with asterisks were calculated over the first 1000 table entries. This gives a better estimate of the best value of \phi for a table of this length or longer. With more entries, the optimal value of \phi inches up, but not greatly. However, the error lower-bound is significantly affected when more samples are used, so only a few of the sufficiently-sampled lower-bounds were shown.

Advertisement

Tags: , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: