Pinot Noir Classification¶
To train and test classification of Pinot Noir wines, we use the script train_test_pinot_noir.py. The goal is to classify wine samples based on their GC-MS chemical fingerprint, using geographic labels at different levels of granularity (e.g., winery, region, country, north-south of Burgundy, or continent).
The script implements a complete machine learning pipeline including data loading, preprocessing, region-based label extraction, feature computation, and repeated classifier evaluation.
Configuration Parameters¶
The script reads configuration parameters from a file (config.yaml) located at the root of the repository. Below is a description of the key parameters:
datasets: Dictionary mapping dataset names to local paths. Each path must contain .D folders for each chromatogram.
selected_datasets: The list of datasets to use for the analysis. Must be compatible in terms of m/z channels.
feature_type: Defines how chromatograms are converted into features for classification:
tic
: Use the Total Ion Chromatogram only.tis
: Use individual Total Ion Spectrum channels.tic_tis
: Concatenate TIC and TIS.concatenated
: Flatten raw chromatograms across all channels.
classifier: Classification model to apply. Available options:
DTC
: Decision Tree ClassifierGNB
: Gaussian Naive BayesKNN
: K-Nearest NeighborsLDA
: Linear Discriminant AnalysisLR
: Logistic RegressionPAC
: Passive-Aggressive ClassifierPER
: PerceptronRFC
: Random Forest ClassifierRGC
: Ridge ClassifierSGD
: Stochastic Gradient DescentSVM
: Support Vector Machine
num_splits: Number of repeated train/test splits to run.
normalize: Whether to apply standard scaling before classification. Normalization is fit on training data only.
n_decimation: Downsampling factor along the retention time axis to reduce dimensionality.
sync_state: Whether to align chromatograms using retention time synchronization (useful for Pinot Noir samples with retention drift).
region: Defines the classification target. Available options:
winery
: Classify by individual wine producerorigin
: Group samples by geographic region (e.g., Beaune, Alsace)country
: Group by country (e.g., France, Switzerland, USA)continent
: Group by continentnorth_south_burgundy
: Binary classification of northern vs southern Burgundy subregions
wine_kind: Internally inferred from dataset paths. Should not be set manually.
Script Overview¶
This script performs classification of Pinot Noir wine samples using GC-MS data and a configurable classification pipeline. It allows for flexible region-based classification using a strategy abstraction.
The main workflow is:
Configuration Loading:
Loads classifier, region, feature type, and dataset settings from config.yaml.
Confirms that all dataset paths are compatible (must contain ‘pinot’).
Data Loading and Preprocessing:
Chromatograms are loaded and decimated.
Channels with zero variance are removed.
If sync_state=True, samples are aligned by retention time.
Label Processing:
Region-based labels are extracted using process_labels_by_wine_kind() and the WineKindStrategy abstraction.
Granularity is determined by the region parameter (e.g., “winery” or “country”).
Classification:
Initializes a Classifier instance with the chosen feature representation and classifier model.
Runs repeated evaluation via train_and_evaluate_all_channels() using the selected splitting strategy.
Cross-Validation and Replicate Handling:
If LOOPC=True, one sample is randomly selected per class along with all of its replicates, then used as the test set. This ensures that each test fold contains exactly one unique wine per class, and no sample is split across train and test. The rest of the data is used for training.
If LOOPC=False, stratified shuffling is used while still preventing replicate leakage.
Evaluation:
Prints average and standard deviation of balanced accuracy across splits.
Displays label ordering and sample distribution.
Set show_confusion_matrix=True to visualize the averaged confusion matrix with matplotlib.
Requirements¶
Properly structured Pinot Noir GC-MS dataset folders
All dependencies installed (see README.md)
Valid paths and regions configured in config.yaml
Usage¶
From the root of the repository, run:
python scripts/pinot_noir/train_test_pinot_noir.py