Motivation & Overview#

  • Modern neuroimaging technologies (EEG, fNIRS, MRI) generate massive and complex data.

  • Machine Learning (ML) methods surged in neuroimaging research, going beyond classical univariate statistics.

  • ML models can now distinguish neurological/psychiatric patients from controls, predict Alzheimer’s, schizophrenia, autism.

  • Importance: pattern recognition, biomarker discovery, clinical decision support.

Neuroimaging Modalities#

  • EEG (Electroencephalography).
    Provides excellent temporal resolution (milliseconds) but limited spatial precision. It is highly sensitive to artifacts such as eye blinks, muscle noise, and electromagnetic interference. Common preprocessing steps include frequency filtering (low-pass/high-pass) and artifact removal methods (e.g., ICA). ML approaches on EEG have successfully classified cognitive traits.

    Example Study:
    Mikheev et al. (2024) investigated EEG patterns during arithmetic, logical, and verbal tasks, applying ML models (logistic regression, Riemann projections, LightGBM with handcrafted spectral power features and explainability via SHAP) to distinguish individuals with mathematics vs. humanities background, achieving balanced accuracies between 0.84–0.89.
    Read the article (Scientific Reports)

    Python libraries:

  • MRI / fMRI.
    Structural MRI provides high spatial resolution anatomical imaging; fMRI captures BOLD signals with ~second-level delays. fMRI data are large (millions of voxels) and susceptible to noise from subject motion, physiological fluctuations (e.g. breathing, heartbeat), and scanner drift. The choice of preprocessing pipeline can substantially alter results.

    Example Study:
    Luppi et al. (2024) systematically evaluated 768 fMRI data-processing pipelines for resting-state functional connectomics and found that most pipelines produced inconsistent or misleading network reconstructions. A subset demonstrated robust performance across datasets and evaluation criteria.
    Read the article (Nature Communications)

    Python libraries:

    • Nilearn – machine learning for neuroimaging data

    • NiBabel – neuroimaging file handling

    • fMRIPrep – standardized preprocessing pipeline

    • Nipype – workflow orchestration

  • fNIRS (Functional Near-Infrared Spectroscopy).
    Measures cortical hemodynamics via optical signals. Advantages include portability and robustness to head motion (suitable for children), while limitations involve shallow penetration depth and sensitivity to physiological noise (pulse, respiration).

    Python libraries:

    • MNE-NIRS – fNIRS integration with MNE

Each modality generates distinct data types—multi-channel time series (EEG), 3D voxel volumes (fMRI), or optical signals (fNIRS)—and requires tailored preprocessing strategies to address modality-specific noise sources and artifacts.

Machine Learning in Neuroimaging#

  • Supervised learning: Classification (patient vs. control), regression (predicting scores).

  • Unsupervised learning: Discover hidden clusters, connectivity subtypes.

  • Contrast with classical stats: GLM tests voxels independently; ML captures multivariate patterns.

Common ML Approaches:

  • Feature engineering + SVM/Random Forest.

  • Deep Learning (CNNs, autoencoders) → segmentation, disease prediction.

  • Validation: Proper CV, avoiding overfitting, ensuring generalization.

Reminder of the Classical Pipeline - ML task#

Let \(X\) is data samples, \(Y\) - targets. \(y : X \rightarrow Y\) is an unknown target function.

Input:

  • \(\{x_1, \dots, x_l\} \subset X\) - training sample;

  • \(y_i = y(x_i), ~i = 1, \dots, l\) - targets.

Output:

  • \(a: X \rightarrow Y\) - predicted function close to \(y\) on all set \(X\).

How objects are set. Feature description#

\(f_j\) - features of objects.

Types of features:

  • Binary feature \(f_j\):

    • gender, headache, weakness, nausea, etc.

  • \(f_j\) - categorical feature:

    • name of the medicine

  • \(f_j\) is an ordinal feature:

    • severity of the condition, jaundice, etc.

  • Quantitative feature:

    • age, pulse, blood pressure, hemoglobin content in the blood, dose of the drug, etc.

Vector \(\big(f_1(x), f_2(x), \ldots, f_n(x)\big)\) is a feature description of the object \(x \in X\).

The feature data is set as follows:

\begin{equation*} F = \begin{pmatrix} f_1(x_1) & \dots & f_n(x_1) \ \vdots & \ddots & \vdots \ f_1(x_ℓ) & \dots & f_n(x_\ell) \end{pmatrix} \end{equation*}

Training classification model:#

Train sample: \(X^\ell = \big(x_i,~y_i\big)_{i=1}^{\ell}, \quad x_i \in \mathbb{R}^n, \quad y_i \in \{-1,~+1\}\)

  • Classification model - linear: \begin{equation} a(x, θ) = \text{sign} \big(\sum_{j=1}^{n} θ_j f_j(x)\big), \quad θ \in \mathbb{R}^n \end{equation}

  • The loss function is binary or its approximation:: \begin{equation} \mathscr{L}(a,~y) = [ay < 0] = \big[x^\top θ \cdot y < 0\big] \le \mathscr{L}\big(x^\top θ \cdot y) \end{equation}

Analysis of classification errors#

The task of classification into two classes, \(y_i\in \{-1,~+1\}\).

Classification algorithm \(a(x_i) \in \{-1,~+1\}\).

By applying the algorithm \(a(x)\) to objects \(x\), we can get \(4\) possible situations:

Positive / Negative - which answer was given by the classifier \(a(x)\). True / False - the classifier gave the correct answer or made a mistake.

Number of correct classifications (the more, the better): \begin{equation} \text{Accuracy} = \frac{1}{\ell}\sum_{i=1}^{\ell}\big[a(x_i) = y_i\big] = ~\frac{\text{TP} + \text{TN}}{\text{FP} + \text{FN} + \text{TP} + \text{TN}} \end{equation}

!!! Disadvantage !!!: does not take into account either the number (imbalance) of classes, or the cost of an error on objects of different classes.

For example: Classification of patients vs. healthy controls • Suppose you have 100 subjects: 95 healthy and 5 patients with a rare neurological disorder. • A classifier that always predicts “healthy” will yield an accuracy of 95%. • At first glance, 95% seems like high performance, but the model never identified a single patient — which was the key task.

Example Tasks#

  • Diagnosis/Prognosis: Predict Alzheimer’s, schizophrenia, etc. from scans.

  • Cognitive decoding: Identify stimulus/mental state from EEG/fMRI.

  • Biomarker discovery: Locate predictive regions or networks.

Data Challenges & Preprocessing#

- High-dimensional, low-sample data → overfitting risk.#

The problem of underfitting and overfitting:#

  • Underfitting: the model is too simple, the number of parameters \(n\) is insufficient.

  • Overfitting: the model is too complex, there is an excessive number of parameters \(n\).

What causes overfitting?

  • excessive complexity of the pamameter space, extra degrees of freedom in the model \(g(x, θ)\) are “spent” on overly accurate fitting to the training sample \(X^l\);

  • overfitting is always there when there is a choice (\(a\) from \(A\)) based on incomplete information (according to the final sample \(X^l\)).

How to detect overfitting?

  • empirically, by dividing the sample into \(\text{train}\) and \(\text{test}\), and the correct answers should be known for \(\text{test}\).

You can't get rid of him. How to minimize it?

  • minimize using HoldOut, LOO or CV, but be careful!!!

  • impose restrictions on \(θ\) (regularization).

Cross-validation (CV)#

An external criterion evaluates the quality of “out-of-training”, for example, by a hold-out control sample \(X^k\): \begin{equation} Q_{\mu}\big(X^\ell, X^k\big) = Q\big(\mu\big(X^\ell\big), X^k\big). \end{equation}

Averaging hold-out estimates over a given \(N\) - set of partitions \(X^L = X_n^{\ell} \bigcup X_n^{k}, \quad n = 1, \ldots, N\):

\begin{equation} \text{CV}\big(\mu, X^L\big) = \frac{1}{\vert N\vert} \sum_{n \in N} Q_{\mu}\big(X_n^{\ell}, X_n^{k}\big). \end{equation}

Special cases are different ways of setting \(N\).

  • A random set of partitions.

  • Complete cross-validation (CCV): \(N\) is the set of all \(C_{\ell+k}^{k}\) partitions.

Disadvantage: CCV estimation is computationally too complicated. Either small values of \(k\) or combinatorial estimates of CCV are used.

  • Sliding control (Leave One Out CV): \(~k=1\), \begin{equation} \text{LOO}\big(\mu, X^L\big) = \frac{1}{L} \sum_{n \in N} Q_{\mu}\big(X^L \backslash {x_i}, {x_i}\big). \end{equation}

Disadvantage: \(\text{LOO}\): resource intensive, high variance.

  • Cross-checking on \(q\) blocks (\(q\)-fold CV): randomly splitting \(X^L=X_1^{\ell_1}\bigcup\ldots X_q^{\ell_q}\) into \(q\) blocks of (almost) equal length,

\begin{equation} \text{CV}q\big(\mu, X^L\big) = \frac{1}{q} \sum{n=1}^{q} Q_{\mu}\big(X^L \backslash X_n^{\ell_n}, ~X_n^{\ell_n}\big). \end{equation}

The disadvantage of \(q\)-fold CV:

  • the score depends significantly on the division into blocks;

  • Each object participates in the control only once.

- Variability: multi-site, multi-device, multimodal fusion issues.#

Preprocessing examples:#

  • EEG: filtering, artifact removal.

  • fMRI: motion correction, normalization.

  • fNIRS: light interference correction.

Hidden Traps (Often Forgotten, Even by Professionals)#

  1. Data leakage

    • Test data contaminates training (normalizing before split, ICA across full dataset).

    • Solution: Always split before preprocessing/feature selection.

  2. Multiple comparisons & statistical hypotheses

    • Thousands of features → false positives.

    • Solution: Control via FDR, Bonferroni, permutation tests.

  3. Correlation ≠ causation

    • Brain–behavior correlations may reflect confounds (motion, demographics, site).

    • Solution: Use covariates, stratified CV, domain adaptation.

  4. Circular analysis (“double dipping”)

    • Selecting voxels/features on same data used for testing.

    • Solution: Nested CV, independent validation sets.

  5. Overfitting

    • Few subjects vs. millions of features.

    • Solution: Report CV with subject-level separation, not trial-level.

  6. Reporting bias

    • Only reporting accuracy without variance or baseline.

    • Solution: Show chance levels, CI, label-shuffling controls.

  7. Misunderstanding the nature of data

    • Not all features or modalities are equivalent; missing modalities (e.g., absent fMRI scans, dropped EEG channels, incomplete behavioral data) require domain-aware strategies.

    • Solution: Before imputation or harmonization, understand why data are missing (technical failure, subject dropout, site heterogeneity).

    • Solution: Apply biologically and technically justified handling (e.g., modality-specific imputation, harmonization methods such as ComBat, or analysis restricted to common modalities).