StrixTheKiet Notes

Search

❯

❯

ArtificialIntelligence

❯

❯

❯

Naive Bayes Net

Naive Bayes Net

Mar 20, 20252 min read

Definition:

Features: $W_{i}$ is the word at position $i$
Predict label conditioned on feature variables
Assume features are conditionally independent given label

Model:

$P (Y, W_{1}, ..., W_{n}) = P (Y) \prod_{i} P (W_{i} ∣ Y)$

Prediction:

$\hat{Y} = ar g max_{y} P (y, W_{1}, ..., W_{n})$

Parameters:

for each word there is probablity given the class
ex: Spam Email Filter: $P (Free ∣ s p am) > P (hello ∣ s p am)$
- MLE for Naive Bayes Spam Classifier:
- Find a single parameter for each word $P (F_{i} ∣ Y = ham)$ as $θ$
- $L (θ) = \prod_{j = 1}^{N_{h}} P (F_{i} = f_{i}^{(j)} ∣ Y = ham) = \prod_{j = 1}^{N_{h}} θ^{f_{i}^{(j)}} (1 - θ)^{1 - f_{i}^{(j)}}$
  - $P (F_{i} = f_{i}^{(j)} ∣ Y = ham) = {θ (1 - θ) if f_{i}^{(j)} = 1 if f_{i}^{(j)} = 0$
- $P (F_{i} ∣ Y = ham) : θ = \frac{1}{N _{h}} j = 1 \sum N_{h} f_{i}^{(j)}$
- Aware of Fitting: Problems with relative-frequency parameters
  - Unlikely to see occurrences of every words in training data.
  - Likely to see occurrences of a word for only 1 class in training data.

Parameter estimation:

$P_{θ} (x)$ means the probability of event x occurring, given the parameter θ.

with maximum likelihood:

Estimating the distribution of a random variable
Empirically: use training data (learning!)
E.g.: red and blue
- For a simple example of guessing if a bean is red/blue, the parameter is the probability of it being red/blue
- for each outcome x, look at the empirical rate of that value: $P (r) = \frac{count(r)}{nb of samples}$
- $P_{θ} (x = r) = θ, P_{θ} (x = b) = 1 - θ$
- Maximum Likelihood Estimation $L (x, θ) = \prod_{i} P_{θ} (x_{i}) = θ . θ . (1 - θ)$
  - a function that assigns a value to different possible parameter values based on how well they explain the observed data.
  - Higher values of $L (x, θ)$ indicate that the parameter value θ is more likely to have resulted in the observed data $x$
General cases of n outcomes:
- $P (H) = q, P (T) = 1 - q$
- Flips are independent and identially distributed $D = {x_{i} ∣ i = 1, ..., n}, P (D ∣ θ) = \prod_{i} P (x_{i} ∣ θ)$
- $D$ is sequence of data $D$ , $P (D ∣ θ) = θ^{α_{H}} (1 - θ)^{α_{T}}$
- Hypothesis space: binomial distributions
- Learning: finding q which is optimal
- MLE solve: choose $q$ that maximize $\hat{θ} = ar g max_{θ} P (D ∣ θ) = ar g max_{θ} ln P (D ∣ θ)$
- ex: 2 heads and 1 tail $\to ar g max_{θ} ln (θ^{2} (1 - θ))$
- for $n$ observations: $\frac{d}{d θ} ln P (D ∣ θ) = \frac{d}{d θ} ln θ^{α_{H}} (1 - θ)^{α_{T}} = 0 \to \hat{θ}_{M L E} = \frac{α _{H}}{α _{H} + α _{T}}$

Smoothing:

Laplace Smoothing:

Laplace’s estimate:
- Pretend you saw every outcome once more than you actually did
- $P_{L A P} (x) = \frac{c ( x ) + 1}{\sum _{x} [ c ( x ) + 1} = \frac{c ( x ) + 1}{N + ∣ X ∣}$ with Dirichlet priors
Laplace’s extended estimate:
- Pretend you saw every outcome k times more than you actually did
- $P_{L A P, k} (x) = \frac{c ( x ) + k}{N + k ∣ X ∣}$ where $k$ is the strenth of the prior
Laplace for conditionals:
- Smooth each condition independently: $P_{L A P, k} (x ∣ y) = \frac{c ( x , y ) + k}{c ( y ) + k ∣ X ∣}$

Naive Bayes classifier:

$\overset{y}{^} = ar g y max P (Y = y ∣ X = x) = ar g y max \frac{P ( X = x ∣ Y = y ) P ( Y = y )}{P ( X = x )} = ar g y max P (Y = y) i = 1 \prod d P (X_{i} = x_{i} ∣ Y = y)$

Graph View

Definition:
Parameter estimation:
Smoothing:
Naive Bayes classifier:

Backlinks

Machine Learning
Model-based Classification
Bayes's Net

Created with strixthekiet

GitHub
Email