Derivative#
The crucial concepts of calculus are derivative and integral.
The linear function
is nice and simple, and its graph is a straight line. Using differentiation and differentials one can in some way reduce any smooth function to a linear one!
In a similar manner any smooth function is linear in a small neighborhood of the tangent point:
A more scientific definition:
Strict definition
The derivative of \(f \colon (x-\delta, x+\delta) \to \mathbb R\), \(\delta > 0\), at point \(x\) is
Existence of derivative \(f'(x)\) is equivalent to differentiability of \(f\) at point \(x\):
Note that (62) imply this equality but the converse is false.
Differential#
The function \(df(x ,h) = f'(x)h\) is called differential of \(f\) at point \(x\). Note that it is a function of two variables \(x\) and \(h\), and the dependency on \(h\) is linear.
Important
Due to historical reasons, the increment \(h\) is often denoted as \(dx\); then the formula for the differential is
Differential is the main linear part of the increment \(\Delta f = f(x + h) - f(x)\).
Rules of differentiation#
Derivative:
\(f'(x) \equiv 0\) if \(f(x)\equiv \mathrm{const}\)
\((\alpha f(x) + \beta g(x))' = \alpha f'(x) + \beta g'(x)\)
\((f(x)g(x))' = f'(x) g(x) + f(x) g'(x)\)
\(\big(\frac{f(x)}{g(x)}\big)' = \frac{f'(x) g(x) - f(x) g'(x)}{g^2(x)}\) if \(g(x) \ne 0\)
\(((f \circ g)(x))' = f'(g(x)) g'(x)\) (chain rule)
Differential:
\(d(\alpha f + \beta g) = \alpha df + \beta dg\)
\(d(fg) = fdg + g df\)
\(d\big(\frac fg\big) = \frac{gdf - f dg}{g^2}\)
\(d(f \circ g)(x) = df(g(x)) = f'(g(x)) \cdot dg(x)\) (chain rule)
Derivatives of higher orders#
If the function \(f'(x)\) is also differentiable, then its derivative is called the second derivative of \(f\): \(f''(x) =\frac d{dx}(f'(x))\). By indtuction, \(n\)-th derivative is defined as
To find the second differential of \(f\) just differentiate \(df(x) = f'(x) dx\) with respect to \(x\) assuming that \(dx\) is a constant:
A function \(f \colon [a, b] \to \mathbb R\) is called continuously differentiable \(n\) times (denoted as \(f \in C^n[a, b]\)) if its \(n\)-th derivative is continuous: \(f^{(n)} \in C[a, b]\). If \(f \in C^{n+1}[a, b]\) then the Taylor formula holds:
where remainder term \(r_n\) is
If \(\lim\limits_{n\to\infty} r_n = 0\), the function \(f\) can be expanded into Taylor series:
If \(a=0\) the Taylor series if called Maclaurin series.
Example
If \(f(x) = e^x\) then \(f^{(k)}(x) = e^x\) for all \(k\in\mathbb N\). Also,
Hence,
Applications of derivatives#
If \(f'(x) > 0\) (\(f'(x) < 0\)) for all \(x\in (a, b)\), then \(f\) is increasing (decreasing) on \((a, b)\).
If \(f'(x) = 0\) and \(f''(x) > 0\) (\(f''(x) < 0\)), then \(x\) is a local minimum (maximum) of \(f\).
If \(f''(x) > 0\) (\(f''(x) < 0\)) for all \(x\in (a, b)\), then \(f\) is strictly convex (concave) on \((a, b)\)
Numeric optimization methods#
In practice optimization problem \(f(x) \to \min\limits_x\) often cannot be solved analytically. However, if we can calculate \(f'(x)\) for any point \(x\), the sign of the derivative will show us the direction of the decrease of the function \(f\).
Let \(\eta >0\). The point of minimum of \(f\) can be found by one of the following iterative algorithms:
Gradient descent
\[ x_{n+1} = x_n - \eta f'(x_n) \]Newton’s method
\[ x_{n+1} = x_n - \eta \frac{f'(x_n)}{f''(x_n)} \]Secant method
\[ x_{n+1} = x_n - \eta \frac{x_n - x_{n-1}}{f'(x_n) - f'(x_{n-1})}f'(x_n) \]
Note that in Newton’s method we need to have access to the second derivative of \(f\) whereas in two other methods only to the first. In secant method the value of \(f''(x_n)\) is approximated by the finite difference \(\frac{f'(x_n) - f'(x_{n-1})}{x_n - x_{n-1}}\).
import matplotlib.pyplot as plt
import numpy as np
%config InlineBackend.figure_format = 'svg'
class Oracle:
def __init__(self, coef):
self._coef = coef
def func(self, x):
return x**4 + self._coef[0] + self._coef[1]*x + self._coef[2]*x**2 + self._coef[3] * x**3
def grad(self, x):
"""
Computes the gradient at point x.
"""
return 4*x**3 + self._coef[1] + 2*self._coef[2]*x + 3*self._coef[3] * x**2
def hess(self, x):
"""
Computes the Hessian matrix at point x.
"""
return 12*x**2 + 2*self._coef[2] + 6*self._coef[3] * x
o4 = Oracle([ -2, 9, 7, -7])
xs = np.linspace(-2, 6, 501)
plt.plot(xs, o4.func(xs), lw=2, c='r', label="f(x)")
plt.title("Polynomial to minimize")
plt.grid(ls=":");
Implement three optimization methods:
def gd(oracle, x_0, eta=2e-3, eps=1e-6, max_iter=1000):
x_star = x_0
for i in range(max_iter):
gradient = oracle.grad(x_star)
if abs(gradient) < eps:
return "success", x_star, i
x_star -= eta * oracle.grad(x_star)
return "Iterations exceeded", x_star, max_iter
def newton(oracle, x_0, eta=0.5, eps=1e-6, max_iter=1000):
x_star = x_0
for i in range(max_iter):
gradient = oracle.grad(x_star)
if abs(gradient) < eps:
return "success", x_star, i
x_star -= eta * gradient / oracle.hess(x_star)
return "Iterations exceeded", x_star, max_iter
def secant(oracle, x_0, eta=0.5, eps=1e-6, max_iter=1000):
x_star, x_prev = x_0, x_0 + 2*eps
for i in range(max_iter):
gradient = oracle.grad(x_star)
if abs(gradient) < eps:
return "success", x_star, i
x_star, x_prev = x_star - eta * oracle.grad(x_star) / (oracle.grad(x_star) - oracle.grad(x_prev)) * (x_star - x_prev), x_star
return "Iterations exceeded", x_star, max_iter
Try several starting points:
def print_result(result, method_name, oracle):
print(method_name, result[0])
print("Number of iterations:", result[2])
print("Point of minimum: {:.3f}".format(result[1]))
print("Function value: {:.3f}".format(oracle.func(result[1])))
print_result(gd(o4, -5), "GD", o4)
print_result(newton(o4, -5), "Newton", o4)
print_result(secant(o4, -5), "secant", o4)
GD success
Number of iterations: 255
Point of minimum: -0.393
Function value: -4.007
Newton success
Number of iterations: 32
Point of minimum: -0.393
Function value: -4.007
secant success
Number of iterations: 34
Point of minimum: -0.393
Function value: -4.007
print_result(gd(o4, 0), "GD", o4)
print_result(newton(o4, 0), "Newton", o4)
print_result(secant(o4, 0), "secant", o4)
GD success
Number of iterations: 251
Point of minimum: -0.393
Function value: -4.007
Newton success
Number of iterations: 22
Point of minimum: -0.393
Function value: -4.007
secant success
Number of iterations: 22
Point of minimum: -0.393
Function value: -4.007
print_result(gd(o4, 2), "GD", o4)
print_result(newton(o4, 2), "Newton", o4)
print_result(secant(o4, 2), "secant", o4)
GD success
Number of iterations: 183
Point of minimum: 4.319
Function value: -48.549
Newton success
Number of iterations: 24
Point of minimum: 1.324
Function value: 9.013
secant success
Number of iterations: 24
Point of minimum: 1.324
Function value: 9.013
print_result(gd(o4, 6), "GD", o4)
print_result(newton(o4, 6), "Newton", o4)
print_result(secant(o4, 6), "secant", o4)
GD success
Number of iterations: 147
Point of minimum: 4.319
Function value: -48.549
Newton success
Number of iterations: 29
Point of minimum: 4.319
Function value: -48.549
secant success
Number of iterations: 30
Point of minimum: 4.319
Function value: -48.549
If learning rate is too large, gradient descent can skip global minimum:
print_result(gd(o4, 6, 0.03), "GD", o4)
GD success
Number of iterations: 7
Point of minimum: -0.393
Function value: -4.007
Or even diverge:
print_result(gd(o4, 6, 0.04, max_iter=10000), "GD", o4)
GD Iterations exceeded
Number of iterations: 10000
Point of minimum: 3.730
Function value: -40.735
Numeric solution by scipy:
from scipy.optimize import minimize
minimize(o4.func, -5, jac=o4.grad)
message: Optimization terminated successfully.
success: True
status: 0
fun: -4.007115450131773
x: [-3.934e-01]
nit: 10
jac: [-1.966e-06]
hess_inv: [[ 3.083e-02]]
nfev: 11
njev: 11
minimize(o4.func, 6, jac=o4.grad)
message: Optimization terminated successfully.
success: True
status: 0
fun: -48.54902987113144
x: [ 4.319e+00]
nit: 7
jac: [ 4.809e-08]
hess_inv: [[ 1.771e-02]]
nfev: 8
njev: 8
Exercises#
Find derivative of \(f(x) = \tanh x = \frac{\sinh x}{\cosh x} = \frac {e^x - e^{-x}}{e^x + e^{-x}}\).
Show that \(\sigma'(x) = \sigma(x) (1 - \sigma(x))\) where
\[ \sigma(x) = \frac 1{1 + e^{-x}} \]— sigmoid function.
Find \(\max\limits_{x\in\mathbb R}\sigma'(x)\).
Give an example of a function \(f\) which is differentiable at point \(x\) but (62) does not hold.
Find the first and the second differential of \(f(x) = \sin x\) at point \(x = \frac \pi 3\).
Find Maclaurin series of \(f(x) = \frac 1{1 - x}\) and \(g(x) = \frac 1{(1-x)^2}\).
Find global maximium of
\[ f(x) = \prod\limits_{i=1}^n \exp\Big(-\frac{(x - a_i)^2}{2 \sigma_i^2}\Big), \quad \sigma_i > 0. \]What if \(\sigma_1 = \ldots = \sigma_n = \sigma > 0\)? Does this function has global minimum?