Blog, Tutorials & Release Notes

Welcome to the PEAXACT blog! Here, you'll find in-depth insights into the software, along with expert tips and tricks to enhance your experience.

Getting Started with Nearest Neighbors Classification

The main objective of this tutorial is to get you familiar with classification using a Nearest Neighbors method (PEAXACT 5: Database Lookup). The tutorial is addressed to PEAXACT users and people interested in PEAXACT.

In this tutorial, you learn how to:

  1. Use categorical features in a Cluster Analysis
  2. Apply pretreatments to improve sample clustering
  3. Perform classification using a Nearest Neighbors method
  4. Identify classes of unknown samples

If you have PEAXACT installed on your computer, you may try this tutorial right away. If you don't have PEAXACT yet, get a free trial now.

Preparations

You can find data for this tutorial in %ProgramFiles%\S-PACT\PEAXACT 5\Data\Raman - Pharma. The directory will be referred to as DATA in the following.

  • Start PEAXACT.
  • Choose File > New Session > Raman from the menu, which opens a new modeling session with default settings for Raman data.

Cluster Analysis

Sample clustering and classification deal with categorical features (also known as grouping variables). Categorical features contain text values – the categories, groups, species, levels, or classes. A cluster analysis aims at dividing samples into groups without knowing the actual classes in advance. If we do know the actual classes, we can perform sanity checks on the found clusters and train a Classification Model. But one step after another:

  • Choose Data > Load Table... from the menu, browse to DATA\References and select DataTableClassification.xlsx to load 90 Raman spectra of pharmaceutical ingredients with associated categorical features: substance name (10 classes), kind of packing (3 classes), and instrument type (3 classes).
  • Select all samples in the Samples Panel.
  • Choose Data > Data Inspector from the menu to start the Data Inspector. Switch from the Data Table Editor to the Data Plotter.
  • From the top-right drop-down list, select Clusters to display a dendrogram.

A dendrogram is a tree that illustrates the arrangement of clusters found by a hierarchical cluster analysis. Each leaf corresponds to one sample. Leaves are connected by branches, forming clusters. Clusters are connected to other clusters, forming even bigger clusters. The height of a branch represents the distance between the two objects being connected. If we assume that our spectra can be distinguished by substance, we would like to see a tree that splits up into 10 big clusters with a large distance to each other. Doesn't look like it yet, though.

Improvements Through Data Pretreatments

  • Select Substance from the "C" drop-down list to colorize the tree by substance.
  • Click the Colorbar icon in the toolbar to display the color legend. (Resize the window if the tree is too small now.)
  • In the Data Pretreatment Panel (bottom-right), you can apply options to manipulate the spectra. See if you can find a combination of pretreatments that improve the clustering. Based on the coloring of the tree, you can judge whether clusters nicely correspond to classes, and given the branch length, how distinctly the clusters are separated.
  • Eventually, let's use the following pretreatments:
    • Resampling: Equidistant Points
    • Number of Points: 1000
    • Global Range: 260 1700 cm-1
    • Smoothing/Derivative: 1st order derivative
    • Filter Length: 19
    • Standardization: SNV normalization
  • Clusters are much more pronounced now. Still, there are two clusters that contain samples of two classes. But that's OK because in both cases (Titanium Dioxide A/B and Lactose Monohydrate A/B), the substances are identical and differ only by manufacturer (A and B).
  • Click the Export button to export pretreatments to a new model.

Nearest Neighbors Classification

  • Back in the main window, select the new model in the Models Panel. Then choose Edit Model > Classification Model > New... from the menu to display the Classification Setup Dialog.
  • The classification method should be set to Nearest Neighbors (formerly Database Lookup) by default. Select {Substance} as the categorical feature to train the model for. Click OK to start the training.

Classification results are presented in a Report Window. The first plot you see is the Confusion Matrix, which shows a per-class performance of the Classification Model. An overall misclassification error for training and test samples is displayed at the bottom of the window. Also, take a look at the other reports, e.g., Identified Class vs. ..., which allows for inspecting the performance in even more detail. In the end, though, Nearest Neighbors is a rather simple classification method, and all to be done is to accept the performance as is.

  • Click OK to accept the Classification Model and close the Report Window.
  • Choose File > Save from the menu to save the model.

Identification Analysis

Now that you've trained the model, it can be used to identify classes of unknown samples.

  • Choose Data > Load Samples... from the menu, browse to DATA\Analysis and load all files. For these files, the classes are unknown.
  • Select the newly added samples in the Samples Panel and choose Analysis > Identification from the menu. Results are displayed in a Report Window. For instance, the Report Table shows identified classes, corresponding Similarities to the respective training spectra, and the Class Probability.

This concludes the tutorial on classification. But we have more for you on other topics. See the PEAXACT Quick Start page for an overview!

Back to Blog Overview