Bordeaux Classification

To train and test classification of Bordeaux wines, we use the script train_test_bordeaux.py. The goal is to classify Bordeaux wine samples based on their GC-MS chemical fingerprint, using either sample-level identifiers (e.g., A2022) or vintage year labels (e.g., 2022) depending on the configuration.

The script implements a complete machine learning pipeline including data loading, label parsing, feature extraction, classification, and repeated evaluation using replicate-safe splitting.

Configuration Parameters

The script reads configuration parameters from a file (config.yaml) located at the root of the repository. Below is a description of the key parameters:

  • datasets: A dictionary mapping dataset names to paths on your local machine. Each path should contain .D folders for raw GC-MS samples.

  • selected_datasets: The list of datasets to include. All selected datasets must be compatible in terms of m/z channels.

  • feature_type: Determines how chromatographic data are aggregated for classification.

    • tic: Use the Total Ion Chromatogram only.

    • tis: Use individual Total Ion Spectrum channels.

    • tic_tis: Concatenates TIC and TIS into a joint feature vector.

  • classifier: The classification algorithm to use. Options include:

    • DTC: Decision Tree Classifier

    • GNB: Gaussian Naive Bayes

    • KNN: K-Nearest Neighbors

    • LDA: Linear Discriminant Analysis

    • LR: Logistic Regression

    • PAC: Passive-Aggressive Classifier

    • PER: Perceptron

    • RFC: Random Forest Classifier

    • RGC: Ridge Classifier

    • SGD: Stochastic Gradient Descent

    • SVM: Support Vector Machine

  • num_splits: Number of repetitions for train/test evaluation. Higher values yield more robust statistics.

  • normalize: Whether to apply standard scaling to features. Scaling is fitted on the training set and applied to test.

  • n_decimation: Downsampling factor for chromatograms along the retention time axis.

  • sync_state: Enables retention time alignment between samples (typically not needed for Bordeaux).

  • region: Not used in Bordeaux classification, but required for other pipelines such as Pinot Noir.

  • class_by_year: If True, samples are classified by vintage year (e.g., 2020, 2021). If False, samples are classified by composite label (e.g., A2022).

  • wine_kind: Internally inferred from the dataset path (should include bordeaux). Should not be set manually.

Script Overview

This script performs classification of Bordeaux wine samples using GC-MS data and a configurable machine learning pipeline.

All parameters are loaded from a central config.yaml file, enabling reproducibility and flexibility.

The main steps include:

  1. Configuration Loading:

    • Loads paths, classifier settings, and feature types from the config file.

    • Verifies that all selected datasets are Bordeaux-type (i.e., paths contain ‘bordeaux’).

  2. Data Loading and Preprocessing:

    • Loads and optionally decimates GC-MS chromatograms using GCMSDataProcessor.

    • Removes channels with zero variance.

    • Optional retention time synchronization can be enabled with sync_state=True.

  3. Label Processing:

    • Labels are parsed based on class_by_year: - If True, classification is done by year (e.g., 2021). - If False, composite labels like A2022 are used.

    • Label extraction and grouping are managed by the WineKindStrategy abstraction layer.

  4. Classification:

    • A Classifier object is initialized with the processed data and selected classifier.

    • The train_and_evaluate_all_channels() method runs repeated evaluations across all channels or selected feature types.

  5. Cross-Validation and Replicate Handling:

    • If LOOPC=True, one sample is randomly selected per class along with all of its replicates, then used as the test set. This ensures that each test fold contains exactly one unique wine per class, and no sample is split across train and test. The rest of the data is used for training.

    • If LOOPC=False, stratified shuffling is used, still preserving replicate integrity using group logic.

  6. Evaluation:

    • Prints mean and standard deviation of balanced accuracy.

    • Displays label counts and ordering used for confusion matrix construction.

    • Set show_confusion_matrix=True to visualize the averaged confusion matrix with matplotlib.

Requirements

  • Properly structured GC-MS dataset folders

  • All required Python dependencies installed (see README.md)

  • Dataset paths correctly specified in config.yaml

Usage

From the root of the repository, run:

python scripts/bordeaux/train_test_bordeaux.py