^{*}

^{a}

^{b}

^{c}

Data mining is one of the most effective statistical methodologies to investigate a variety of problems in areas including pattern recognition, machine learning, bioinformatics, chemometrics, and statistics. In particular, statistically-sophisticated procedures that emphasize on reliability of results and computational efficiency are required for the analysis of high-dimensional data. Optimization principles can play a significant role in the rationalization and validation of specialized data mining procedures. This paper presents a novel methodology which is Multi-Choice Wavelet Thresholding (MCWT) based three-step methodology consists of three processes: perception (dimension reduction), decision (feature ranking), and cognition (model selection). In these steps three concepts known as wavelet thresholding, support vector machines for classification and information complexity are integrated to evaluate learning models. Three published data sets are used to illustrate the proposed methodology. Additionally, performance comparisons with recent and widely applied methods are shown.

Data mining procedures are essentially based on statistical principles and machine learning theory both creatively integrated to effect and facilitate the identification of significant informative patterns for a given database. Recurrent strategies used in data mining include preprocessing, data partitioning, machine learning (modeling), and validation. The ultimate goal of these procedures is the disclosure of unknown and valuable information.

As indicated by

Classification analysis methods, based on several types of different algorithms, have been proposed to find successful models for complicated data in an extensive range of application domains. The objective of classification analysis is to identify groups of observations based on the input variables which minimize the within group-variability and maximize the between group-variability. Recently, not only classification area but also other supervised or unsupervised learning areas have faced two challenging issues: (i) the curse of dimensionality; and (ii) nonlinearity. Several researchers developed new classification analysis techniques for preventing problems of the curse of dimensionality; spectral regression discriminant analysis (

Variable selection is an important area of research in machine learning, pattern recognition, statistics, and related fields. The key idea of variable selection is to find input variables which have predictive information and to eliminate non-informative variables. The use of variable selection techniques is motivated by three reasons: (i) to improve discriminant power; (ii) to find fast and cost-effective variables; and (iii) to reach a better understanding of the application process (

This article proposes a novel methodology based on an integration of both the multi-choice wavelet thresholding (MCWT) and a variable selection method for classification to perform three steps known as perception, decision, and cognition. The proposed procedure will be referred to as a

The section “The Proposed PDCM” reviews relevant procedures integrated to design the proposed methodology. These procedures can be classified into the following areas:

Wavelet thresholding-based dimension reduction

Variable selection (feature ranking)

Cognition accuracy (model selection)

The performance of the methodology is tested using three published data sets and the corresponding results are documented in section “Experimental Results”. Section “Discussion” concludes this article. The information of three real benchmark datasets is presented in section “

The proposed Perception-Decision-Cognition Methodology (PDCM) for discriminant analysis is conceptually represented in

Perceive environmental information.

Decide on response (actions).

Cognize the accuracy of results to adjust the response.

The algorithm used by the PDCM consists of three steps conceptually described below, after assuming that all data have been classified according to three sets: training set, cognition (validation) set, and test set.

Let the sample data be

The procedure can be described as follows. Remove each of the _{PERF}_{PERF}

Compute the accuracy value of each cognition data set using the SVMC for all possible subsets of the ranked variables selected in Step 2. Specifically, first consider the variable with the highest rank (i.e, rank = 1), and calculate the cognition accuracy value. After this, the two variables with rank = 1 and rank = 2 are considered, and a new cognition accuracy value is calculated. This procedure is repeated until all ranked variables are considered. Finally, the subset of variables resulting in the highest accuracy value is chosen as the best model. The reason of using the two steps for transforming the nonlinear input data with wavelets and for finding informative variables with _{PERF} - RFE

Dimension reduction is a preferred strategy in the area of machine learning. As anticipated, there are several approaches to perform dimensional reduction. The following methods are among the most popular: principal component analysis (

The dimension reduction strategy has important benefits that can be measured not only in terms of computational time savings, but also in accuracy improvement. In the novel PDCM, the wavelet-based dimension reduction is applied in Step 1. The wavelets approach was selected because of several attractive attributes, among which the following two are most relevant: (a) wavelets adapt effectively to spatial features of a function such as discontinuities and varying frequency behavior; (b) wavelets have efficient

Discrete Wavelet Transformation (DWT) is often used for dimension reduction (also known as shrinkage or threshold). The data constructed with the scaling and wavelet functions based on orthogonal base in time domain is as follows:

where

where

through the transformation

For multiple samples, let vector

where

Small absolute values of wavelet coefficients are undesirable since they may be influenced more by noise than by information. On the other hand, large absolute values are more influenced by information than noise. This observation motivates the development of threshold methods. There are two threshold rules usually referred to as

here

VisuShrink is a soft thresholding technique that applies a universal threshold proposed by

where,

VET was proposed by

where ^{th} wavelet position for the ^{th} data curve,

The VET method minimizes the overall relative reconstruction error (

Since the VET procedure does not consider the scale information of wavelets, an improved procedure proposed by

where, ^{th}

Once the reduced sample space is determined in Step 1, the decision regarding which of the remaining variables should be selected for ranking is made on the basis of minimal information complexity values, following the Information Complexity Performance Testing with Recursive Feature Elimination (_{PERF}-RFE

The development of information complexity for the discriminant analysis is evaluated using the modified maximal entropic complexity _{1}_{F}

where s is the rank of ^{th} eigenvalue of

_{PERF}

where lack of fit is assessed by means of the first three terms and complexity by the fourth one. In the above expression,

and

where

and

Specific details on this procedure are provided by

When the ranking decision is finished in Step 2, the corresponding accuracies are determined using the corresponding cognition sets and the support vector machines for classification (SVMC) described below. Once the accuracies are calculated for the selected models the most-accurate one is chosen.

The SVMC find an optimal separating hyperplane that maximizes the margin between the classes (

where

subject to

where

Function | Parameters | |
---|---|---|

Gaussian | ||

Cauchy | ||

Inverse Multi-Quadratic |

The point

and

where

In order to emphasize the effectiveness of the PDCM, it will be applied to three different data sets (Heart, Fat, & Handwritten) used for experiments with low-dimension high sample size, and high-dimension low sample size. The data sets are divided as 3 sets; training (50% of total set), cognition (10% of total set), and test (40% of total set) sets. For the wavelet transformation of the three data sets, the linear padding suggested by (

The data set includes 267 samples and 44 variables on cardiac single proton emission computed tomography (SPECT) images with two categories, i.e., normal and abnormal. There are 55 normal and 212 abnormal classes (^{1}

The data is available at

_{PERF}-RFE

Method | Selected Variables from _{PERF}-RFE |
Number of Variables | Cognition Accuracy | Test Accuracy |
---|---|---|---|---|

PDCM (MSVET) | 20, 28, 30 | 3 | 100% | 92% |

PDCM (VET) | 20, 28, 50 | 3 | 100% | 92% |

PDCM (VERTI) | 19, 34 | 2 | 100% | 92% |

PDCM (VISU UNION) | 9, 20, 44, 53 | 4 | 100% | 92% |

PDCM (VISU INTERSECT) | 8, 15 | 2 | 100% | 92% |

SVMC-RFE | 6, 16, 17, 18, 26, 32, 35 | 7 | 92% | 76% |

Two-Stage | 43 | 1 | 92% | 78% |

Method | Selected Variables from _{PERF}-RFE |
Number of Variables | Cognition Accuracy | Test Accuracy |
---|---|---|---|---|

PDCM (MSVET) | 6, 10, 18, 29 | 4 | 92% | 76% |

PDCM (VET) | 6, 10, 18, 29 | 4 | 92% | 76% |

PDCM (VERTI) | 6, 10, 18, 29 | 4 | 92% | 76% |

PDCM (VISU UNION) | 9, 10, 16, 29 | 4 | 92% | 76% |

PDCM (VISU INTERSECT) | 10, 19, 28 | 3 | 92% | 78% |

SVMC-RFE | 22 | 1 | 92% | 76% |

Two-Stage | 43 | 1 | 92% | 78% |

As observed in

These data were collected by a Tecator infratec food and feed analyzer to predict the fat content of a meat sample based on near infrared (NIR) spectroscopy. The data set was divided into two classes defined on the basis of fat content; one class (low-fat) corresponded to 20% or less, and another class (high-fat) to more than this level (^{2}

The data is available at

_{PERF}-RFE

Method | Selected Variables from _{PERF}-RFE |
Number of Variables | Cognition Accuracy | Test Accuracy |
---|---|---|---|---|

PDCM (MSVET) | 2, 5 | 2 | 91% | 90% |

PDCM (VET) | 5, 32 | 2 | 91% | 90% |

PDCM (VERTI) | 6, 32 | 2 | 91% | 84% |

PDCM (VISU UNION) | 1, 5 | 2 | 91% | 90% |

PDCM (VISU INTERSECT) | 4, 28 | 2 | 91% | 91% |

SVMC-RFE | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 | 96 | 82% | 84% |

Two-Stage | 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45 | 13 | 82% | 86% |

Method | Selected Variables from _{PERF}-RFE |
Number of Variables | Cognition Accuracy | Test Accuracy |
---|---|---|---|---|

PDCM (MSVET) | 4, 5, 14, 17, 18, 29 | 6 | 91% | 92% |

PDCM (VET) | 4, 5, 10, 16, 22, 25, 32 | 7 | 91% | 91% |

PDCM (VERTI) | 1, 5, 8, 12, 14, 20, 22, 25, 27 | 9 | 91% | 86% |

PDCM (VISU UNION) | 1, 5, 6, 13, 17, 19, 24, 25, 30 | 9 | 91% | 91% |

PDCM (VISU INTERSECT) | 5, 9, 10, 16, 19, 28 | 6 | 91% | 88% |

SVMC-RFE | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 | 92 | 73% | 85% |

Two-Stage | 40, 43, 45 | 3 | 73% | 80% |

This data set has features of handwritten numerals from 0 to 9 extracted from a collection of Dutch utility maps. The entire set consists of 200 samples digitized in binary images per numerals and six different variable sets: 76 fourier coefficients, 216 profile correlations, 64 Karhunen-Love coefficients, 240 pixel averages, 47 Zernike moments and 6 morphological features (^{3}

The data is available at

_{PERF}-RFE

Method (Wavelet-Based Method) | Selected Variables from _{PERF}-RFE |
Number of Variables | Cognition Accuracy | Test Accuracy |
---|---|---|---|---|

PDCM (MSVET) | 112 | 1 | 100% | 88% |

PDCM (VET) | 25 | 1 | 100% | 93% |

PDCM (VERTI) | 25 | 1 | 100% | 93% |

PDCM (VISU UNION) | 25 | 1 | 100% | 93% |

PDCM (VISU INTERSECT) | 25 | 1 | 100% | 93% |

SVMC-RFE | 324, 334, 339, 340, 344, 349, 456 | 7 | 100% | 89% |

Two-Stage | 8, 10, 11, 15 | 4 | 80% | 80% |

Method | Selected Variables from _{PERF}-RFE |
Number of Variables | Cognition Accuracy | Test Accuracy |
---|---|---|---|---|

PDCM (MSVET) | 112, 155 | 2 | 100% | 97% |

PDCM (VET) | 25 | 1 | 100% | 100% |

PDCM (VERTI) | 25 | 1 | 100% | 100% |

PDCM (VISU UNION) | 87, 345 | 2 | 100% | 98% |

PDCM (VISU INTERSECT) | 24 | 1 | 100% | 100% |

SVMC-RFE | 48, 82, 106, 154, 166, 217, 218, 223, 224, 225, 226, 230, 231, 232, 235, 236, 237, 238, 239, 240, 241, 246, 249, 250, 251, 257, 258, 261, 262, 264, 265, 273, 276, 277, 279, 280, 283, 284, 288, 289, 291, 294, 295, 298, 299, 300, 303, 304, 309, 310, 312, 318, 319, 325, 333, 334, 335, 337, 342, 348, 349, 350, 352, 358, 363, 364, 365, 367, 373, 374, 376, 378, 379, 380, 382, 387, 389, 391, 393, 394, 395, 397, 402, 404, 408, 410, 412, 422, 423, 424, 427, 428, 435, 436, 437, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 454, 455, 456 | 109 | 100% | 92% |

Two-Stage | 8, 10, 11, 15 | 4 | 70% | 80% |

Cauchy and inverse multi-quadratic kernel functions are used in

This article documents the development and application of a novel Perception-Decision-Cognition Methodology (PDCM) for classification analysis based on the MCWT and SVMC with _{PERF}-RFE

The PDCM is directly applied to three real datasets instead of using simulated datasets having different characteristics in this article. As supported by the numerical experiments documented in this article, the PDCM outperforms the currently available data mining approaches, and, furthermore, shows to be applicable to various areas, such as bioinformatics, chemometrics, pattern recognition, and other data mining fields. The PDCM has three advantages:

Dimension simplification.

Multiple model choices based on simplified dimension.

Novel rank based variable selection: _{PERF}-RFE

The authors have no funding to report.

The authors have declared that no competing interests exist.

This paper was modified from the Ph.D. dissertation (

The authors have no support to report.