Mathematical aproximation
In mathematics , a smooth maximum of an indexed family x 1 , ..., x n of numbers is a smooth approximation to the maximum function
max
(
x
1
,
… … -->
,
x
n
)
,
{\displaystyle \max(x_{1},\ldots ,x_{n}),}
meaning a parametric family of functions
m
α α -->
(
x
1
,
… … -->
,
x
n
)
{\displaystyle m_{\alpha }(x_{1},\ldots ,x_{n})}
such that for every α , the function
m
α α -->
{\displaystyle m_{\alpha }}
is smooth, and the family converges to the maximum function
m
α α -->
→ → -->
max
{\displaystyle m_{\alpha }\to \max }
as
α α -->
→ → -->
∞ ∞ -->
{\displaystyle \alpha \to \infty }
. The concept of smooth minimum is similarly defined. In many cases, a single family approximates both: maximum as the parameter goes to positive infinity, minimum as the parameter goes to negative infinity; in symbols,
m
α α -->
→ → -->
max
{\displaystyle m_{\alpha }\to \max }
as
α α -->
→ → -->
∞ ∞ -->
{\displaystyle \alpha \to \infty }
and
m
α α -->
→ → -->
min
{\displaystyle m_{\alpha }\to \min }
as
α α -->
→ → -->
− − -->
∞ ∞ -->
{\displaystyle \alpha \to -\infty }
. The term can also be used loosely for a specific smooth function that behaves similarly to a maximum, without necessarily being part of a parametrized family.
Examples
Boltzmann operator
Smoothmax of (−x, x) versus x for various parameter values. Very smooth for
α α -->
{\displaystyle \alpha }
=0.5, and more sharp for
α α -->
{\displaystyle \alpha }
=8.
For large positive values of the parameter
α α -->
>
0
{\displaystyle \alpha >0}
, the following formulation is a smooth, differentiable approximation of the maximum function. For negative values of the parameter that are large in absolute value, it approximates the minimum.
S
α α -->
(
x
1
,
… … -->
,
x
n
)
=
∑ ∑ -->
i
=
1
n
x
i
e
α α -->
x
i
∑ ∑ -->
i
=
1
n
e
α α -->
x
i
{\displaystyle {\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {\sum _{i=1}^{n}x_{i}e^{\alpha x_{i}}}{\sum _{i=1}^{n}e^{\alpha x_{i}}}}}
S
α α -->
{\displaystyle {\mathcal {S}}_{\alpha }}
has the following properties:
S
α α -->
→ → -->
max
{\displaystyle {\mathcal {S}}_{\alpha }\to \max }
as
α α -->
→ → -->
∞ ∞ -->
{\displaystyle \alpha \to \infty }
S
0
{\displaystyle {\mathcal {S}}_{0}}
is the arithmetic mean of its inputs
S
α α -->
→ → -->
min
{\displaystyle {\mathcal {S}}_{\alpha }\to \min }
as
α α -->
→ → -->
− − -->
∞ ∞ -->
{\displaystyle \alpha \to -\infty }
The gradient of
S
α α -->
{\displaystyle {\mathcal {S}}_{\alpha }}
is closely related to softmax and is given by
∇ ∇ -->
x
i
S
α α -->
(
x
1
,
… … -->
,
x
n
)
=
e
α α -->
x
i
∑ ∑ -->
j
=
1
n
e
α α -->
x
j
[
1
+
α α -->
(
x
i
− − -->
S
α α -->
(
x
1
,
… … -->
,
x
n
)
)
]
.
{\displaystyle \nabla _{x_{i}}{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {e^{\alpha x_{i}}}{\sum _{j=1}^{n}e^{\alpha x_{j}}}}[1+\alpha (x_{i}-{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n}))].}
This makes the softmax function useful for optimization techniques that use gradient descent .
This operator is sometimes called the Boltzmann operator,[ 1] after the Boltzmann distribution .
LogSumExp
Another smooth maximum is LogSumExp :
L
S
E
α α -->
(
x
1
,
… … -->
,
x
n
)
=
1
α α -->
log
-->
∑ ∑ -->
i
=
1
n
exp
-->
α α -->
x
i
{\displaystyle \mathrm {LSE} _{\alpha }(x_{1},\ldots ,x_{n})={\frac {1}{\alpha }}\log \sum _{i=1}^{n}\exp \alpha x_{i}}
This can also be normalized if the
x
i
{\displaystyle x_{i}}
are all non-negative, yielding a function with domain
[
0
,
∞ ∞ -->
)
n
{\displaystyle [0,\infty )^{n}}
and range
[
0
,
∞ ∞ -->
)
{\displaystyle [0,\infty )}
:
g
(
x
1
,
… … -->
,
x
n
)
=
log
-->
(
∑ ∑ -->
i
=
1
n
exp
-->
x
i
− − -->
(
n
− − -->
1
)
)
{\displaystyle g(x_{1},\ldots ,x_{n})=\log \left(\sum _{i=1}^{n}\exp x_{i}-(n-1)\right)}
The
(
n
− − -->
1
)
{\displaystyle (n-1)}
term corrects for the fact that
exp
-->
(
0
)
=
1
{\displaystyle \exp(0)=1}
by canceling out all but one zero exponential, and
log
-->
1
=
0
{\displaystyle \log 1=0}
if all
x
i
{\displaystyle x_{i}}
are zero.
Mellowmax
The mellowmax operator[ 1] is defined as follows:
m
m
α α -->
(
x
)
=
1
α α -->
log
-->
1
n
∑ ∑ -->
i
=
1
n
exp
-->
α α -->
x
i
{\displaystyle \mathrm {mm} _{\alpha }(x)={\frac {1}{\alpha }}\log {\frac {1}{n}}\sum _{i=1}^{n}\exp \alpha x_{i}}
It is a non-expansive operator. As
α α -->
→ → -->
∞ ∞ -->
{\displaystyle \alpha \to \infty }
, it acts like a maximum. As
α α -->
→ → -->
0
{\displaystyle \alpha \to 0}
, it acts like an arithmetic mean. As
α α -->
→ → -->
− − -->
∞ ∞ -->
{\displaystyle \alpha \to -\infty }
, it acts like a minimum. This operator can be viewed as a particular instantiation of the quasi-arithmetic mean . It can also be derived from information theoretical principles as a way of regularizing policies with a cost function defined by KL divergence. The operator has previously been utilized in other areas, such as power engineering.[ 2]
p-Norm
Another smooth maximum is the p-norm :
‖ ‖ -->
(
x
1
,
… … -->
,
x
n
)
‖ ‖ -->
p
=
(
∑ ∑ -->
i
=
1
n
|
x
i
|
p
)
1
p
{\displaystyle \|(x_{1},\ldots ,x_{n})\|_{p}=\left(\sum _{i=1}^{n}|x_{i}|^{p}\right)^{\frac {1}{p}}}
which converges to
‖ ‖ -->
(
x
1
,
… … -->
,
x
n
)
‖ ‖ -->
∞ ∞ -->
=
max
1
≤ ≤ -->
i
≤ ≤ -->
n
|
x
i
|
{\displaystyle \|(x_{1},\ldots ,x_{n})\|_{\infty }=\max _{1\leq i\leq n}|x_{i}|}
as
p
→ → -->
∞ ∞ -->
{\displaystyle p\to \infty }
.
An advantage of the p-norm is that it is a norm . As such it is scale invariant (homogeneous ):
‖ ‖ -->
(
λ λ -->
x
1
,
… … -->
,
λ λ -->
x
n
)
‖ ‖ -->
p
=
|
λ λ -->
|
⋅ ⋅ -->
‖ ‖ -->
(
x
1
,
… … -->
,
x
n
)
‖ ‖ -->
p
{\displaystyle \|(\lambda x_{1},\ldots ,\lambda x_{n})\|_{p}=|\lambda |\cdot \|(x_{1},\ldots ,x_{n})\|_{p}}
, and it satisfies the triangle inequality .
Smooth maximum unit
The following binary operator is called the Smooth Maximum Unit (SMU):[ 3]
max
ε ε -->
(
a
,
b
)
=
a
+
b
+
|
a
− − -->
b
|
ε ε -->
2
=
a
+
b
+
(
a
− − -->
b
)
2
+
ε ε -->
2
{\displaystyle {\begin{aligned}\textstyle \max _{\varepsilon }(a,b)&={\frac {a+b+|a-b|_{\varepsilon }}{2}}\\&={\frac {a+b+{\sqrt {(a-b)^{2}+\varepsilon }}}{2}}\end{aligned}}}
where
ε ε -->
≥ ≥ -->
0
{\displaystyle \varepsilon \geq 0}
is a parameter. As
ε ε -->
→ → -->
0
{\displaystyle \varepsilon \to 0}
,
|
⋅ ⋅ -->
|
ε ε -->
→ → -->
|
⋅ ⋅ -->
|
{\displaystyle |\cdot |_{\varepsilon }\to |\cdot |}
and thus
max
ε ε -->
→ → -->
max
{\displaystyle \textstyle \max _{\varepsilon }\to \max }
.
See also
References
https://www.johndcook.com/soft_maximum.pdf
M. Lange, D. Zühlke, O. Holz, and T. Villmann, "Applications of lp-norms and their smooth approximations for gradient based learning vector quantization," in Proc. ESANN , Apr. 2014, pp. 271-276.
(https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2014-153.pdf )