diff options
Diffstat (limited to 'en_GB/Introduction to Machine Learning/introduction_to_machine_learning.md')
-rw-r--r-- | en_GB/Introduction to Machine Learning/introduction_to_machine_learning.md | 206 |
1 files changed, 205 insertions, 1 deletions
diff --git a/en_GB/Introduction to Machine Learning/introduction_to_machine_learning.md b/en_GB/Introduction to Machine Learning/introduction_to_machine_learning.md index 0e17258..a765486 100644 --- a/en_GB/Introduction to Machine Learning/introduction_to_machine_learning.md +++ b/en_GB/Introduction to Machine Learning/introduction_to_machine_learning.md @@ -64,6 +64,193 @@ $$ \text{maximize}\ P(\textbf{w} | D) \Leftrightarrow \text{minimize}\ -\text{lo $$ \textbf{w}_{\text{MAP}} = \text{argmax}_{\textbf{w}}(P(\textbf{w} | D)) = \text{argmin}_{\textbf{w}}(\frac{1}{2} \sum\limits_{n=1}^{N} (y(x_n, \textbf{w}) - t_n)^2 + \frac{\alpha}{2} \textbf{w}^{T} \textbf{w}) $$ $$ \alpha := \text{Hyperparameter, denoting initial uncertainty} $$ +### Bayesian predictive distribution + +$$ \int p(t | x, \textbf{w}) p(\textbf{w} | \textbf{x}, \textbf{t}) d \textbf{w} = \mathcal{N}(t | m(x), s^{2}(x)) $$ +$$ s^{2}(x) := \text{Estimated variance} $$ + +### Incremental bayesian learning + +1. $P(\textbf{w} | D_1) = \frac{P(D_1, \textbf{w}) P(\textbf{w})}{P(D_1)}$ +2. $P(\textbf{w} | D_2) = \frac{P(D_2, \textbf{w}) P(\textbf{w} | D_1)}{P(D_2)}$ + +... + +### Linear model + +$$ y(\textbf{x}, \textbf{w}) = \sum\limits_{m=0}^{M} w_m {\Phi}_{m}(\textbf{x}) = \textbf{w}^T {\Phi}(\textbf{x}) $$ +$$ {\Phi}(\textbf{x}) := \text{Design matrix} $$ + +### Design matrix + +$$ {\Phi}(X) = +\begin{bmatrix} +{\Phi}_{1}(\textbf{x}_1) & \cdots & {\Phi}_{M}(\textbf{x}_1) \\ +\vdots & \ddots & \vdots \\ +{\Phi}_{1}(\textbf{x}_N) & \cdots & {\Phi}_{M}(\textbf{x}_N) +\end{bmatrix} +$$ +$$ X := \text{Input matrix (columns are inputs and rows of each column are input coordinates)} $$ + +### Quadratic-error-function for linear model + +$$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} (t_n - \textbf{w}^T {\Phi}(\textbf{x}_n))^2 $$ + +### Parameter optimization for linear model + +$$ \textbf{w}_{ML} = {\Phi}^{\#} \textbf{t} $$ +$$ {\Phi}^{\#} = ({\Phi}^T \Phi)^{-1} {\Phi}^T $$ +$$ {\Phi}^{\#} := \text{Moore-Penrose pseudo inverse} $$ + +- Get the parameters $\textbf{w}$ for the given target vector $\textbf{t}$ + +### Maximum a-posteriori approach for linear model + +$$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} (t_n - \textbf{w}^T {\Phi}(\textbf{x}_n))^2 + \frac{\lambda}{2}\left\|\textbf{w}\right\|^2 $$ +$$ \textbf{w}_{\text{MAP}} = (\lambda \textbf{I} + {\Phi}^T \Phi)^{-1} {\Phi}^T \textbf{t} $$ + +### Bayesian predictive distribution + +$$ p(t | x, \textbf{x}, \textbf{t}) = \int p(t | x, \textbf{w}) p(\textbf{w} | \textbf{x}, \textbf{t}) d \textbf{w} = \mathcal{N}(t | m(x), s^2(x)) $$ +$$ m(x) = \beta {\Phi}(x)^T \textbf{S} \sum\limits_{n=1}^{N} {\Phi}(x_n) t_n $$ +$$ s^2(x) = {\beta}^{-1} + {\Phi}(x)^T \textbf{S} {\Phi}(x) $$ +$$ \textbf{S}^{-1} = \alpha \textbf{I} + \beta \sum\limits_{n=1}^{N} {\Phi}(x_n) {\Phi}(x_n)^T $$ +$$ \textbf{S} := \text{Covariance matrix} $$ + +### Gradient descent + +- Update parameters $\textbf{w}$ iteratively + +$$ \textbf{w}_{k+1} = w_k - \eta \frac{dE(\textbf{w})}{d \textbf{w}} $$ +$$ \eta := \text{Learning rate} $$ + +### Weighted least squares error function + +$$ E(\textbf{w}) = \frac{1}{2} \sum\limits_{n=1}^{N} d_n (t_n - \textbf{w}^t \textbf{x}_n)^2 $$ +$$ d_n := \text{Weight for term $n$} $$ + +### Weighted least squares parameter optimization + +$$ \textbf{w}_{\text{ML}} = (\Phi D \Phi)^{-1} {\Phi}^T D T = (X^T D X)^{-1} X^T D T $$ +$$ D := \text{Diagonal weight matrix with $D_{nn} = d_n$} $$ + +### Locally weighted regression (unified model) + +$$ t(\textbf{x}) = \sum\limits_{e=1}^{E} {\Phi}_e (\textbf{x}, {\theta}_e) (\textbf{w}_e^T \textbf{x} + b_e) $$ +$$ {\theta}_e = ({\mu}_e, {\Sigma}_e) := \text{Gaussian parameters} $$ +$$ {\Phi}(\textbf{x}, \theta) = \mathcal{N}(\textbf{x} | \mu, \Sigma) $$ + +### Classification (K=2) + +$$ \textbf{w}^T \textbf{x} + w_0 = 0\ \text{(Hyperplane)} $$ +$$ \textbf{w}^T \textbf{x} + w_0 < 0 \Rightarrow \text{Class 0} $$ +$$ \textbf{w}^T \textbf{x} + w_0 > 0 \Rightarrow \text{Class 1} $$ +$$ \textbf{w}^T \textbf{x} + w_0 = 0 \Rightarrow \text{On boundary} $$ + +### Fisher discriminant + +- Midpoints $\textbf{m}_1 = \frac{1}{N_1} \sum\limits_{n \in C_1} \textbf{x}_n$, $\textbf{m}_2 = \frac{1}{N_2} \sum\limits_{n \in C_2} \textbf{x}_n$ +- Intra-class variance: $s_k^2 = \sum\limits_{n \in C_k} (\textbf{w}^T \textbf{x}_n - \textbf{w}^T \textbf{m}_k)^2 = \sum\limits_{n \in C_k} (\textbf{w}^T \textbf{x}_n - m'_k)^2$ +- Optimization: $\textbf{w} = \textbf{S}_{W}^{-1}(\textbf{m}_2 - \textbf{m}_1)$ +- Within-class variance: $\textbf{S}_{W} = \sum\limits_{n \in C_1} (\textbf{x}_n - \textbf{m}_1) (\textbf{x}_n - \textbf{m}_1)^T + \sum\limits_{n \in C_2} (\textbf{x}_n - \textbf{m}_2) (\textbf{x}_n - \textbf{m}_2)^T$ +- Classification: $\textbf{w}^T (\textbf{x} - \textbf{m}) > 0 \Rightarrow C_1$, $\textbf{w}^T (\textbf{x} - \textbf{m}) < 0 \Rightarrow C_2$ + +### Bayesian classification + +$$ p(C_k | \textbf{x}) = \frac{p(\textbf{x} | C_k) p(C_k)}{\sum\limits_{j} p(\textbf{x} | C_j) p(C_j)} = \frac{\text{exp}(a_k)}{\sum\limits_{j} \text{exp}(a_j)} $$ +$$ a_j = \text{ln} p(\textbf{x} | C_j) p(C_j) $$ +$$ j \in \{1, ..., K\} $$ + +### Bayesian classification with gaussian deviation + +$$ p(\textbf{x} | C_k) = \frac{1}{2 {\pi}^{\frac{D}{2}}} \frac{1}{{| \Sigma |}^{\frac{1}{2}}} \text{exp}(- \frac{1}{2} (\textbf{x} - {\mu}_k)^T {\Sigma}^{-1} (\textbf{x} - {\mu}_k)) $$ +$$ p(C_1 | \textbf{x}) = \sigma (\text{ln} \frac{p(\textbf{x} | C_1) p(C_1)}{p(\textbf{x} | C_2) p(C_2)}) = \sigma (\textbf{w}^T \textbf{x} + w_0) $$ +$$ \textbf{w} = {\Sigma}^{-1} ({\mu}_1 - {\mu}_2) $$ +$$ w_0 = - \frac{1}{2} {\mu}_1^T {\Sigma}^{-1} {\mu}_1 + {\mu}_2^T {\Sigma}^{-1} {\mu}_2 + \text{ln} \frac{p(C_1)}{p(C_2)} $$ + +### Generalization algorithm + +1. Init $h$ with $< \emptyset , \emptyset , ..., \emptyset >$ +2. For all positive examples $x$ +3. For all constraints $a_i$ +4. If $a_i$ is fulfilled by $x$ +5. OK +6. Else +7. Generalize $a_i$ + +### Specialization algorithm + +1. Init $h$ with $< ?, ?, ..., ? >$ +2. For all negative examples $x$ +3. For all constraints $a_i$ +4. If $a_i$ is fulfilled by $x$ +5. OK +6. Else +7. Replace $a_i$ with most special + +### Candidate elimination + +- Initialize $G$ to the set of maximally general hypotheses in $H$ +- Initialize $H$ to the set of maximally specific hypotheses in $H$ + +1. For each training instance $x$: +2. If $x$ is a positive example +2.1. Remove from $G$ any hypothesis inconsistent with $x$ +2.2. For each hypothesis $s$ in $S$ that is not consistent with $x$ +2.2.1. Remove $s$ from $S$ +2.2.2. Add to $S$ all minimal generalizations $h$ of $s$ such that $h$ is consistent with $x$, and some member of $G$ is more general than $h$ +2.2.3. Remove from $S$ any hypothesis that is more general than another hypothesis in $S$ +3. If $x$ is a negative example +3.1. Remove from $S$ any hypothesis inconsistent with $x$ +3.2. For each hypothesis $g$ in $G$ that is not consistent with with $x$ +3.2.1. Remove $g$ from $G$ +3.2.2. Add to $G$ all minimal specializations $h$ of $g$ such that $h$ is consistent with $x$, and some member of $S$ is more specific than $h$ +3.2.3. Remove from $G$ any hypothesis that is less general than another hypothesis in $G$ + +### K-Nearest neighbours + +$$ f(x_q) = \frac{1}{k} \sum\limits_{i=0}^{k} f(\hat{x}_i) $$ + +### Weighted K-Nearest neighbours + +$$ f(x_q) = \frac{\sum\limits_{i=1}^{k} w_i f(x_i)}{\sum\limits_{i=1}^{k} w_i} $$ +$$ w_i = d(x_i, x_q)^{-1} $$ + +### Gaussian kernel regression + +$$ f(x_q) = \sum\limits_{i} y_i \frac{K_{\sigma}(x_i - x_q)}{\sum\limits_{j} K_{\sigma} (x_j - x_q)} $$ + +### Vector quantization + +1. Choose number of prototypes $K$ +2. Initialize position of all prototypes $K$ +3. Iterate over data: +3.1. Choose random $\textbf{x}_n$ +3.2. Find nearest prototype $\textbf{w}_c$ +3.3. Move the prototype to the direction of the data point $\textbf{w}_c^{\text{new}} = \textbf{w}_c^{\text{old}} + \eta (\textbf{w}_c^{\text{old}} - \textbf{x}_n)$ + +### K-Means algorithm + +1. Number of prototypes $K$ +2. Initialize prototypes $\textbf{w}_k$ +2.1. Minimize $J$ with respect to $r_{nk}$ while keeping $\textbf{w}_k$ fixed +2.2. Minimize $J$ with respect to $\textbf{w}_k$ while keeping $r_{nk}$ fixed + +$$ J = \sum\limits_{n=1}^{N} \sum\limits_{k=1}^{K} r_{nk} \left\|\textbf{x}_n - \textbf{w}_k\right\|^2 $$ + +### Gaussian mixture model + +- Number of Gaussians $K$ +- Initialize priors ${\pi}_k$, centers $\textbf{w}_k$, covariances ${\Sigma}_k$ (by e.g. K-means) + +1. For all data points $\textbf{x}_n$ +2. E-Step: Update assignment probabilities +$$P_k(\textbf{x}_n) = \frac{{\pi}_k \mathcal{N}(\textbf{x}_n | \textbf{w}_k, {\Sigma}_k)}{\sum\limits_{i=1}^K {\pi}_k \mathcal{N}(\textbf{x}_n | \textbf{w}_i, {\Sigma}_i)}$$ +3. M-Step: Update all $k$ prior probabilites, cluster centers and covariances +$${\pi}_k = \frac{1}{N} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n)$$ +$$ \textbf{w}_k = \frac{1}{N {\pi}_k} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n) \textbf{x}_n $$ +$$ {\Sigma}_k = \frac{1}{N {\pi}_k} \sum\limits_{n=1}^{N} P_k(\textbf{x}_n) (\textbf{x}_n - \textbf{w}_k) (\textbf{x}_n - \textbf{w}_k)^T $$ + ## Definitions ### Likelihood @@ -72,4 +259,21 @@ Function, describing the joint probability of the data $\textbf{x}$ as function ### Bayesian-approach -Probabilistic model for the parameters, not the actual data.
\ No newline at end of file +Probabilistic model for the parameters, not the actual data. + +### Classification approaches + +1. Find discriminant function $f(\textbf{x}) : x \rightarrow \{0,1\}$ +2. Bayesian approach +2.1 Model class-conditional likelihood $p(x | C_k)$ +2.2 Model an a-priori probability distribution $p(C_k)$ +2.3 Calculate the posterior $p(C_k | x)$ +3. Direct posterior modeling $p(C_k | x)$ + +### Concept + +Trying to identify properties to which to identify the classes. + +### Hypothesis + +Function to identify if input is in class or not.
\ No newline at end of file |