Minimum amount of measure theory necessary to understand probability theory behind machine learning.

These notes are based on (Capinski & Kopp, 2013), (Rosenthal, 2006) and (Qian, 2016).

**Definition (\(\sigma\)-algebra).** Let \(\Omega\) be a set. Then a \(\sigma\)-algebra \(\mathcal F\) is a nonempty collection of subsets of \(\Omega\) such that

- \(\Omega \in \mathcal F\).
- If \(A\) is in \(\mathcal F\), then so is the complement of \(A\).
- If \(A_n\) is a sequence of elements of \(\mathcal F\), then the union of \(A_n\) is in \(\mathcal F\).

Call \((\Omega, \mathcal F)\) a measurable space. \(\square\)

**Definition (Measure).**
Let \((\Omega, \mathcal F)\) be a measurable space.
Let \(\mu: \mathcal F \to \bar{\mathbb R}\) be a mapping, where \(\bar{\mathbb R}\) denotes the set of extended real numbers.
Then \(\mu\) is called a measure on \(\mathcal F\) if and only if it has the following properties:

- For every \(F \in \mathcal F\), \(\mu(F) \geq 0\).
- For every sequence of pairwise disjoint sets \(S_n \subseteq \Omega\): \begin{align} \mu\left(\cup_{n = 1}^\infty S_n \right) = \sum_{n = 1}^\infty \mu(S_n). \end{align} (that is, \(\mu\) is a countably additive function)
- \(\mu(\emptyset) = 0\). \(\square\)

**Definition (Probability measure).**
Let \((\Omega, \mathcal F)\) be a measurable space.
A measure \(P\) on this space is called a probability measure if \(P(\Omega) = 1\).

Call \((\Omega, \mathcal F, P)\) a probability triple. \(\square\)

**Definition (Measurable function).**
Let \((\Omega, \mathcal F)\) be a measurable space.
Let \((\mathcal X, \mathcal E)\) be another measurable space.
Let \(f: \Omega \to \mathcal X\) be a function.
Define \(f^{-1}(E) := \{\omega: \omega \in \Omega, f(\omega) \in E\}\) for \(E \in \mathcal E\).
\(f\) is said to be \(\mathcal F\)-measurable if \(f^{-1}(E) \in \mathcal F\) for all \(E \in \mathcal E\).
\(\square\)

**Definition (Random variable).**
Let \((\Omega, \mathcal F, P)\) be a probability triple.
Let \((\mathcal X, \mathcal E)\) be a measurable space.
Then a function \(X: \Omega \to \mathcal X\) is called a random variable if it is \(\mathcal F\)-measurable.
\(\square\)

**Definition (Probability distribution).**
Given a random variable \(X\) on a probability triple \((\Omega, \mathcal F, P)\) and the output space \((\mathcal X, \mathcal E)\), the probability distribution of \(X\) is \(P \circ X^{-1}\).
We write \(P_X := P \circ X^{-1}\).

Note that \(P_X\) is a valid measure on \((\mathcal X, \mathcal E)\).

We also call \(P_X\) *law of \(X\)* and denote \(\mathcal L(X)\).
\(\square\)

**Definition (Integration).**

**Definition (Expectation).**

**Definition (Product measures).**

**Theorem (Radon-Nikodym).**

**Definition (Probability density).**

**Definition (Conditional expectation).**

**Definition (Conditional probability).**

**Theorem (Bayes’ rule).**

**Theorem (Sum rule).**

**Theorem (Product rule).**

**References**

- Capinski, M., & Kopp, P. E. (2013).
*Measure, integral and probability*. Springer Science & Business Media.@book{capinski2013measure, title = {Measure, integral and probability}, author = {Capinski, Marek and Kopp, Peter E}, year = {2013}, publisher = {Springer Science \& Business Media} }

- Rosenthal, J. S. (2006).
*A first look at rigorous probability theory*. World Scientific.@book{rosenthal2006first, title = {A first look at rigorous probability theory}, author = {Rosenthal, Jeffrey Seth}, year = {2006}, publisher = {World Scientific} }

- Qian, Z. (2016).
*Lecture notes on the course “B8.1 Martingales through Measure Theory.”*Mathematical Institute, University of Oxford.@misc{qian2016martingales, author = {Qian, Zhongmin}, title = {Lecture notes on the course ``B8.1 Martingales through Measure Theory''}, month = sep, year = {2016}, publisher = {Mathematical Institute, University of Oxford}, link = {https://courses.maths.ox.ac.uk/node/124}, file = {../assets/pdf/qian2016martingales.pdf} }

[back]