MicroPython not initialized.

Explainable Artificial Intelligence (XAI) refers to a set of methods and practices aimed at making the behavior and decisions of artificial intelligence systems understandable to humans. As machine learning and deep learning models have grown in complexity, the need to demystify their inner workings has become increasingly important - especially in domains where decisions impact lives, such as healthcare, law, and public administration. XAI seeks to provide transparency, interpretability, and trust by enabling users to comprehend why a model made a certain prediction, how it weighs input data, and to what extent it can be held accountable for its outcomes.

Traditionally, many machine learning models - especially those involving deep neural networks - function as "black boxes," offering high predictive accuracy but little insight into how they arrive at specific decisions. Explainability challenges this paradigm by emphasizing models whose mechanisms can be logically traced, mathematically analyzed, and, ideally, intuitively understood by both developers and domain experts.

In this context, MicroPython offers a uniquely advantageous environment for fostering explainability. As a lightweight implementation of Python designed for microcontrollers, MicroPython encourages minimalism, clarity, and a hands-on approach to programming. When complex artificial intelligence architectures are rebuilt in MicroPython from the ground up - without relying on abstracted libraries or opaque function calls - the result is a codebase that closely mirrors the mathematical logic behind machine learning and neural network operations. This stripped-down setting compels the developer to engage directly with core principles such as matrix multiplication, activation functions, and gradient descent, making the implementation not just visible, but pedagogically powerful.

By removing the comfort of high-level libraries, MicroPython invites learners and researchers to reengage with the fundamentals. Each neuron, each layer, and each weight update must be explicitly defined and managed, thereby providing a unique opportunity to demystify how models process data, adjust parameters, and converge toward solutions. This low-level perspective is not only educational but also instrumental in aligning artificial intelligence development with the ideals of transparency and accountability central to XAI. Thus, combining the discipline of MicroPython with the goals of XAI creates a framework that is as instructive as it is principled - offering a rare and valuable bridge between theoretical understanding and practical implementation. Therefore, this comprehensive codebook covers a well chosen set of common statistical basics, as well as machine learning and deep learning algorithms.

1. Statistical Basics - I

We will start with some statistical basics: Mean, variance and standard deviation. As part of univariate statistics, they not only serve to describe individual variables, but are also important foundations for advanced statistical analyses.

1.1. Dataset

One variable of the trees dataset, provided by Atkinson, A. C. (1985): Plots, Transformations and Regression via Oxford University Press:

Code 1
Ready.

1.2. Mean

The mean, also known as the arithmetic mean, is one of the most common measures of central tendency in statistics. It represents the average value of a dataset and provides a single value that summarizes the entire data distribution. To calculate the mean, you sum all values in your dataset and divide this total by the number of values:

Code 2
Ready.

1.3. Variance

The sample variance is a measure of how spread out the values in a dataset are. It quantifies the average squared deviation from the mean, giving insight into the variability within the sample. Unlike population variance, it divides by n−1 to account for the degrees of freedom, making it an unbiased estimator when working with a sample:

Code 3
Ready.

1.4. Standard Deviation

The standard deviation is the square root of the variance and provides a measure of spread in the same units as the original data. It indicates how much the values in a dataset typically deviate from the mean, making it easier to interpret than variance in practical terms:

Code 4
Ready.

1.5. Application

These are application examples for mean, sample variance and standard deviation in MicroPython:

Code 5
Ready.

2. Statistical Basics - II

After analyzing individual variables, the focus is now on the possible interactions between two variables. The statistical basis for this is called bivariate statistics. Common methods include covariance, correlation and simple linear regression.

2.1. Dataset

Two variables of the trees dataset, provided by Atkinson, A. C. (1985): Plots, Transformations and Regression via Oxford University Press:

Code 6
Ready.

2.2. Covariance

The covariance measures the directional relationship between two variables. A positive covariance indicates that the variables tend to increase together, while a negative covariance suggests that as one increases, the other tends to decrease. It's a foundational concept in statistics for understanding how two variables vary together:

Code 7
Ready.

2.3. Correlation

The correlation quantifies the strength and direction of the linear relationship between two variables. It standardizes the covariance by dividing it by the product of the standard deviations, resulting in a value between -1 and 1. A correlation close to 1 or -1 indicates a strong relationship, while a value near 0 suggests little to no linear association:

Code 8
Ready.

2.4. Single Linear Regression

A single linear regression models the relationship between two variables by fitting a straight line to the data. It calculates the slope b and intercept a of the line y=a+bx, where b indicates how much y changes for each unit increase in x, and a is the predicted value of y when x=0:

Code 9
Ready.

2.5. Predict Function

The predict function is required to determine the respective y values for the underlying x values via a and b:

Code 10
Ready.

2.6. Residuals

Residuals represent the differences between the observed values and the predicted values from a linear regression model. They indicate how well the model fits the data: a residual close to 0 means a good fit, while larger residuals suggest that the model doesn't capture the data as accurately. The residuals can be used to assess the assumptions of linear regression and identify any outliers:

Code 11
Ready.

2.7. Coefficient of Determination

The coefficient of determination measures the proportion of variance in the dependent variable that is explained by the independent variable in a regression model. It indicates the goodness of fit: an coefficient of determination close to 1 means that the model explains most of the variance, while a value near 0 suggests the model doesn’t capture much of the variability:

Code 12
Ready.

2.8. Application

These are application examples for covariance, correlation, as well as the single linear regression with the coresponding predictions, residuals and the coefficient of determination in MicroPython:

Code 13
Ready.

3. Machine Learning - I

Because dependent variables are generally not dependent on just one independent variable, it is advisable to broaden the perspective to include multivariate statistics which can take several independent variables into account. Therefore, multiple linear regression is introduced as fist multivariate approach for regression tasks and in order to predict the outcome of the dependent variable.

3.1. Dataset

Three variables of the trees dataset, provided by Atkinson, A. C. (1985): Plots, Transformations and Regression via Oxford University Press:

Code 14
Ready.

3.2. Matrix Inversion

Matrix inversion is essential in solving systems of linear equations, particularly in methods like multiple linear regression. The following code implements the Gaussian elimination method to invert a matrix, ensuring it is invertible by checking for non-zero pivots during the process:

Code 15
Ready.

3.3. Matrix Transposition

Matrix transposition involves flipping a matrix over its diagonal, converting rows into columns and vice versa. The resulting matrix is called the transpose of the original matrix. Transposition is commonly used in linear algebra, especially in operations like solving systems of equations or adjusting data representations:

Code 16
Ready.

3.4. Matrix Multiplication

Matrix multiplication is a way of combining two matrices to create a new one. This operation is essential in many areas of linear algebra, including solving systems of linear equations and applying transformations. It is important for multiple linear regression because it allows you to calculate the coefficients of the regression model by multiplying the inverse of the design matrix with the target values:

Code 17
Ready.

3.5. Multiple Linear Regression

With these mathematical basics, the multiple linear regression can be calculated as follows:

Code 18
Ready.

3.6. Predict Function

A slightly modified predict function is required to determine the respective y values for the underlying x values:

Code 19
Ready.

3.7. Residuals

Again, residuals represent the differences between the observed values and the predicted values:

Code 20
Ready.

3.8. Coefficient of Determination

The coefficient of determination for a multiple linear regression model measures how well the model's predictions match the actual data. It indicates the proportion of the variance in the target variable that can be explained by the model. It's interpretation is therefore similar to the coefficient of determination of a single linear regression model and may vary between 0 and 1, while a value closer to 0 means the model doesn't explain much of the variance:

Code 21
Ready.

3.9. Application

Finally, these are application examples for the multiple linear regression coefficients with coresponding predictions for one case, the residuals of the model as well as the coefficient of determination in MicroPython:

Code 22
Ready.

4. Machine Learning - II

Another multivariate approach can be demonstrated via multiple logistic regression. This time, the dependent variable is nominally scaled and enables a distinction to be made between classes 0 or 1. As a result, this multivariate statistics approach can be used for classifcation tasks.

4.1. Dataset

Three variables of the trees dataset, provided by Atkinson, A. C. (1985): Plots, Transformations and Regression via Oxford University Press. The dependant variable has been dichotomized, whereby a volume greater than 20 results in 1, else 0:

Code 23
Ready.

4.2. Sigmoid Function

The sigmoid function in (multiple) logistic regression maps any input value to a range between 0 and 1, allowing us to interpret the result as a probability by producing an S-shaped curve ideal for binary classification with 0=no and 1=yes, for example:

Code 24
Ready.

4.3. Log Function

The log function approximates the natural logarithm using a numerical method based on the limit definition, useful when built-in log functions are unavailable in MicroPython. The natural logarithm (ln) is the inverse of the exponential function and tells us how many times we must multiply e≈2.71828 to get a given number:

Code 25
Ready.

4.4. Prediction of Probabilities

As a result of these mathematical basics, a function for the prediction of probabilities is required for processing the values of the previous sigmoid function:

Code 26
Ready.

4.5. Gradient Descent Training

This function trains a logistic regression model using gradient descent. It iteratively updates the weights and biases to minimize the error between predicted probabilities (from the sigmoid function) and actual labels. By adjusting the weights in the direction that reduces the loss, the model gradually learns to classify input data:

Code 27
Ready.

4.6. Predict Function

Again, a slightly modified predict function is required to determine the respective y values for the underlying x values:

Code 28
Ready.

4.7. Application

The application examples for the multiple logistic regression focus on weights and bias of the model and return logits and probabilities as values for classification. A classification example highlights the functionalty of multiple logistic regression models:

Code 29
Ready.

5. Machine Learning - III

title It is possible that not all cases in a data set are equivalent. Accordingly, similar cases can be clustered to enable detailed analyses of the corresponding clusters. Many different clustering algorithms are available and k-Means clustering will be demonstrated since it is particularly illustrative and commonly used.

5.1. Dataset

Two variables and 15 cases of the original trees dataset, provided by Atkinson, A. C. (1985): Plots, Transformations and Regression via Oxford University Press. The other 15 cases are simulated trees, based upon another type of tree. Therefore, the dependant variable is dichotomized, indicating black cherry trees from the original dataset by 0 and simulated trees by 1:

Code 30
Ready.

5.2. Euclidean Distance

The euclidean distance measures the straight-line distance between two points in a multi-dimensional space, calculated as the square root of the sum of the squared differences between corresponding coordinates. It’s commonly used in clustering and classification tasks to determine similarity between data points:

Code 31
Ready.

5.3. Centroids Function

The initializing centroids function sets the starting points for the cluster centers and influences the convergence of the algorithm and the quality of the final clusters, as it determines how the data is grouped during the iterative process:

Code 32
Ready.

5.4. Assigning Clusters Function

The assigning clusters function groups data points into clusters based on their proximity to the centroids. For each point, it calculates the euclidean distance to each centroid and assigns the point to the closest centroid’s cluster, ensuring that each cluster contains the points nearest to its respective centroid:

Code 33
Ready.

5.5. Computing Centroids Function

The computing centroids function calculates the new centroids by finding the mean of all points within each cluster. For each cluster, it averages the values of each feature across all points, updating the centroid to represent the center of that cluster.

Code 34
Ready.

5.6. Within Cluster Sum of Squares

The within cluster sum of squares is defined as the total squared distance between each point and its assigned cluster centroid. It measures the compactness of the clusters, with smaller values indicating tighter clusters. The code computes this by summing the squared differences for all points in each cluster, relative to the centroid of that cluster:

Code 35
Ready.

5.7. k-Means Algorithm

The k-means algorithm groups data points into k clusters. It iteratively assigns points to the closest centroids, recalculates the centroids, and computes the within cluster sum of squares until the centroids no longer change or the maximum number of iterations is reached, returning the final within cluster sum of squares value to assess the clustering quality:

Code 36
Ready.

5.8. k-Means Indicator

The k-means indicator highlights the chance of the within cluster sum of squares values when the number of centroids is increased. A decreasing value indicates a better allocation of the cases to the centroids:

Code 37
Ready.

5.9. Application

The application examples indicate the within cluster sum of squares for each number of clusters. In addition, it indicates the number of clusters within the dataset, which in this case is supposed to be 2 and assigns the labels accordingly. The position of the centroids is highlighted as well:

Code 38
Ready.

6. Machine Learning - IV

A factor analysis is a statistical method that reduces a large number of variables into a smaller, more manageable set of underlying factors. It helps identify hidden patterns and relationships within data, making it easier to understand complex structures.

6.1. Dataset

These are 10 variables, based upon 5 variables each for the two personality dimensions extraversion and neuroticism, from the bfi dataset by Revelle, W., Wilt, J. and A. Rosenthal (2010): Individual Differences in Cognition: New Methods for examining the Personality-Cognition Link via Springer:

Code 39
Ready.

6.2. Mean Center Function

Mean centering via a mean center function, is often used in data preprocessing to make the dataset more suitable for machine learning algorithms by ensuring all features contribute equally to the model:

Code 40
Ready.

6.3. Correlation Matrix

Again, the correlation matrix quantifies the strength and direction of the linear relationship between two variables. This code can be used to correlate several variables and summarize the corresponding values in one matrix:

Code 41
Ready.

6.4. Power Iteration Function

The power iteration function is an algorithm used to compute the dominant eigenvalues and eigenvectors of a matrix. The process involves iteratively applying matrix-vector multiplication to a random initial vector and normalizing it to avoid overflow or underflow, which allows the vector to converge to the eigenvector corresponding to the largest eigenvalue:

Code 42
Ready.

6.5. Factor Loadings Function

The factor loadings function computes the factor loadings based on the correlation matrix, eigenvalues, and eigenvectors. In factor analysis, factor loadings represent the relationships between observed variables and the underlying latent factors.

Code 43
Ready.

6.6. Application

This application example shows how to compute the correlation matrix, the eigenvalues as well as the corresponding factor loadings for identifying the underlying factors:

Code 44
Ready.

7. Deep Learning - I

A neural network consists of neurons and layers that process data via activation functions. Weights and biases are necessary in order to activate a neuron and to reach out for other neurons in another layer. The first neural network will use pretrained weights and biases straight out of TensorFlow in order to identify and process non-linear patterns within datasets.

7.1. Dataset

These are five variables from the iris dataset by Fisher, R. (1936): The use of multiple measurements in taxonomic problems via John Wiley & Sons. The four independent variables are based upon length and width of the sepal leaf (x1 and x2) as well as the petal leaf (x3 and x4). All indipendent variables are standardized. Additionaly, the dependent variable differs between versicolor (0) and virginica (1) as different species of iris flowers.

Code 45
Ready.

7.2. Libraries

Normally all functions in MicroPython can be coded manually. However, the math library is imported here to simplify the execution the exponential function.

Code 46
Ready.

7.3. Activation Functions

Neural networks are based upon neurons and activation functions decide whether a neuron should be activated or not. This means that it will decide whether the neuron's input to the neural network is important or not in the process of prediction using simpler mathematical operations like Rectified Linear Unit (ReLU), Leaky Rectified Linear Unit (Leaky ReLU), Hyperbolig Tangent (Tanh) or Logistic Regression (Sigmoid).

Code 47
Ready.

7.4. Single Neuron

A single neuron therefore accesses one of the previously defined activation functions and can be defined as follows in MicroPython:

Code 48
Ready.

7.5. Data Formats and Processing

In order for the data to be adequately processed by a neural network, a series of data formats such as vectors, matrices and the architecture of neural networks via layers must be defined. These mathematical basics of a neural network can be defined as follows:

Code 49
Ready.

7.6. Weights and Biases

The architecture of a neural network can be reconstructed in MicroPython with the weights and biases from a already pretrained deep learning model. They can be transferred from TensorFlow (which is a deep learning library suitable for Python) to MicroPython. The following structure indicates four independent variables (rows) for two neurons (columns) in the input layer with the according weight w1. In addition, the first layer has two accoring biases b1. Therefore, the first hidden layer consists of three neurons with w2 and b2, the second hidden layer consists of two neurons with w3 and b3 and the output layer is a single neuron with w4 and b4. As a result, this neural network consists of a total of eight neurons.

Code 50
Ready.

7.7. Neural Network Architecture

According to the transferred weights and biases the architecture of the neural network can be defined in MicroPython as follows. This specifies the number of neurons within each layer and the activation functions for activating the neurons.

Code 51
Ready.

7.8. Confusion Matrix

A confusion matrix, also known as an error matrix, is a table that visualizes the performance of a classification model by comparing its predictions against the actual results. It's a two-dimensional matrix that displays the counts of true positives, true negatives, false positives, and false negatives, providing a detailed view of where a model's predictions are correct and where it's making errors.

Code 52
Ready.

7.9. Application

The performance of the pretrained neural network can be viewed via the following MicroPython code:

Code 53
Ready.

8. Deep Learning - II

The second neural network will adjust weights and biases automatically. As a result, this neural network can identify and process non-linear patterns within datasets on its own.

8.1. Dataset

Again, the five variables from the iris dataset by Fisher, R. (1936): The use of multiple measurements in taxonomic problems via John Wiley & Sons will be used. The four independent variables are based upon length and width of the sepal leaf (x1 and x2) as well as the petal leaf (x3 and x4). All indipendent variables are standardized. Additionaly, the dependent variable differs between versicolor (0) and virginica (1) as different species of iris flowers.

Code 54
Ready.

8.2. Libraries

The random library and math library are imported to simplify the execution of some functions required for self-learning neural networks.

Code 55
Ready.

8.3. Activation Functions and Derivatives

Self-learning neural networks not only require activation functions, but also their derivates. The derivative of a function represents its instantaneous rate of change at a specific point. This allows the neural network to be trained.

Code 56
Ready.

8.4. Function for Random Initialization

Since the neural network is supposed to learn the weights and biases by itself, the layers and neurons of the neural network will be initialized with some random values.

Code 57
Ready.

8.5. Forward and Backward Data Processing

In neural networks, forward propagation is the process of passing input data through the network's layers to generate a prediction and backward propagation, on the other hand, is the mechanism used to train the network by calculating the error between the prediction and the actual output, and then adjusting the network's weights to minimize that error. This important for the learning ability of a neural network.

Code 58
Ready.

8.6. Loss Function

Furthermore, a loss function quantifies the difference between a deep learning model's prediction and the actual outcome, essentially acting as a measure of the model's error. Cross-entropy, a specific type of loss function, is commonly used for classification problems, especially when the model outputs probabilities.

Code 59
Ready.

8.7. Neural Network Architecture

This time the architecture of the neural network consists of four independent variables which will be forwarded to three neurons in the input layer and one neuron in the output layer. This is a very simple neural network that consists of four neurons in two layers with according weights (w1 and w2) and biases (b1 and b2).

Code 60
Ready.

8.8. Specification of Learning Behavior

Finally, the number of epochs and the learning rate need to be specified in MicroPython. In neural networks, an epoch represents one complete pass of the entire training dataset through the model. Learning rate determines how much the model's weights are adjusted during each update step in the training process. Both are crucial hyperparameters that influence training and model performance.

Code 61
Ready.

8.9. Predict Funtion

The outcome of the neural network can be predicted with the collowing code in MicroPython:

Code 62
Ready.

8.10. Confusion Matrix

As in the pretrained neural network before, a confusion matrix can be used to evaluate the performance of the neural network.

Code 63
Ready.

8.11. Application

Finally, the performance of the neural network can be inspected and validated with new data:

Code 64
Ready.