Multi-Choice Wavelet Thresholding Based Binary Classification Method

Data mining is one of the most effective statistical methodologies to investigate a variety of problems in areas including pattern recognition, machine learning, bioinformatics, chemometrics, and statistics. In particular, statistically-sophisticated procedures that emphasize on reliability of results and computational efficiency are required for the analysis of high-dimensional data. Optimization principles can play a significant role in the rationalization and validation of specialized data mining procedures. This paper presents a novel methodology which is Multi-Choice Wavelet Thresholding (MCWT) based three-step methodology consists of three processes: perception (dimension reduction), decision (feature ranking), and cognition (model selection). In these steps three concepts known as wavelet thresholding, support vector machines for classification and information complexity are integrated to evaluate learning models. Three published data sets are used to illustrate the proposed methodology. Additionally, performance comparisons with recent and widely applied methods are shown.

expedite efficiency of data mining; (ii) data mining methodologies enlarge the scope of operations research applications; and (iii) integration of both data mining and operations research boost systems performance. Furthermore, the key element that allows effective fusion of both areas is the use of optimization algorithms (with particular emphasis on search procedures) to find an accurate model and development of metaheuristics. An example of such procedures is the search algorithm by Olafsson et al. (2008) to find the best variable subset.
Classification analysis methods, based on several types of different algorithms, have been proposed to find successful models for complicated data in an extensive range of application domains. The objective of classification analysis is to identify groups of observations based on the input variables which minimize the within group-variability and maximize the between group-variability. Recently, not only classification area but also other supervised or unsupervised learning areas have faced two challenging issues: (i) the curse of dimensionality; and (ii) nonlinearity. Several researchers developed new classification analysis techniques for preventing problems of the curse of dimensionality; spectral regression discriminant analysis (Cai et al., 2008), automatic non-parameter uncorrelated discriminant analysis (Yang et al., 2008), high-dimensional discriminant analysis (Bouveyron et al., 2007), and for avoiding problems of nonlinearity; adaptive nonlinear discriminant analysis (Kim et al., 2006), kernel Fisher discriminant analysis (Mika, 2002), support vector machines for classification and regression (Vapnik, 1995).
Variable selection is an important area of research in machine learning, pattern recognition, statistics, and related fields. The key idea of variable selection is to find input variables which have predictive information and to eliminate non-informative variables. The use of variable selection techniques is motivated by three reasons: (i) to improve discriminant power; (ii) to find fast and cost-effective variables; and (iii) to reach a better understanding of the application process (Guyon & Elisseeff, 2003). In the case of high-dimensional data, variable selection plays a crucial role because of four challenges (Theodoridis & Koutroumbas, 2006): (i) large set of variables; (ii) existence of irrelevant variables; (iii) presence of redundant variables; and (iv) data noise.
This article proposes a novel methodology based on an integration of both the multi-choice wavelet thresholding (MCWT) and a variable selection method for classification to perform three steps known as perception, decision, and cognition. The proposed procedure will be referred to as a Perception-Decision-Cognition Methodology (PDCM). The main idea of this methodology is to provide better classification power by integrating both optimal search and data mining procedures. The perception step includes five different dimension reduction methods, based on wavelets, to transform original data into a representation form that exhibits orthogonality and de-noising. Multi-choice wavelet thresholding tries to say that the proposed method integrates these five different dimension reduction methods. But in the computational approach, five versions of the proposed procedure are compare and each version uses one of these dimension reduction methods. The decision step uses information complexity to find informative variables which can be used to identify groups based on prior modeling information. The cognition step recognizes the best model based on the support vector machines for classification, a well-known kernel-based statistical data mining approach. As the optimal way, new methodologies are usually tested to check capabilities of the procedures with simulated datasets which have different characteristics. But, the proposed PDCM is directly applied to three real datasets in this article. Three numerical experiments were run to compare the PDCM to other often-used procedures. The results from the experiments show that the proposed method outperforms all the procedures used in the experiments.
The section "The Proposed PDCM" reviews relevant procedures integrated to design the proposed methodology. These procedures can be classified into the following areas: a. Wavelet thresholding-based dimension reduction b. Variable selection (feature ranking) c. Cognition accuracy (model selection) The performance of the methodology is tested using three published data sets and the corresponding results are documented in section "Experimental Results". Section "Discussion" concludes this article. The information of three real benchmark datasets is presented in section "Supplementary Materials".

Perception-Decision-Cognition Methodology (PDCM)
The proposed Perception-Decision-Cognition Methodology (PDCM) for discriminant analysis is conceptually represented in Figure 1. As indicated in this figure, it consists of three steps: 1. Perceive environmental information. 2. Decide on response (actions). 3. Cognize the accuracy of results to adjust the response.
The algorithm used by the PDCM consists of three steps conceptually described below, after assuming that all data have been classified according to three sets: training set, cognition (validation) set, and test set.

Step 1: Perceive Sample Space and Data Dimensions
Let the sample data be X = (x 1 , x 2 ,...,x q ),and the corresponding response be y = (y 1 , y 2 ,...,y n ) T , where q is the dimension of X and n is the number of samples. Now apply all available MCWT techniques: VisuShrinkUnion, VisuShrinkIntersect, VertiShrink, , where the reduced dimension p is the number of coefficients perceived by the reduction techniques (p ≤ q).

Step 2: Decide on Variables Given Information Complexity
The procedure can be described as follows. Remove each of the p variables one at a time, and evaluate the corresponding information complexity measure, ICOMP PERF . Once the p -1 removal procedures are completed, the removed variable resulting in minimum value of ICOMP PERF is identified and assigned the lowest rank (i.e. p). This procedure is repeated for the remaining p -1 variables for which there is no rank yet. As a result of this, a variable receives rank equal to p -1. This procedure is repeated until the p variables have been arranged according to their ranks.

Step 3: Cognize Accuracy of Selected Models
Compute the accuracy value of each cognition data set using the SVMC for all possible subsets of the ranked variables selected in Step 2. Specifically, first consider the variable with the highest rank (i.e, rank = 1), and calculate the cognition accuracy value. After this, the two variables with rank = 1 and rank = 2 are considered, and a new cognition accuracy value is calculated. This procedure is repeated until all ranked variables are considered. Finally, the subset of variables resulting in the highest accuracy value is chosen as the best model. The reason of using the two steps for transforming the nonlinear input data with wavelets and for finding informative variables with ICOMP PERF -RFE is to find more accurate and faster models.

Conceptual View of the Perception-Decision-Cognition Methodology (PDCM)
Step

1: Wavelet Thresholding-Based Dimension Reduction Techniques
Dimension reduction is a preferred strategy in the area of machine learning. As anticipated, there are several approaches to perform dimensional reduction. The following methods are among the most popular: principal component analysis (Jolliffe, 2002), rotational linear discriminant analysis technique (Sharma & Paliwal, 2008), independent component analysis (Stone, 2004), semi-definite embedding (Weinberger & Saul, 2006), multifactor dimensionality reduction (Ritchie & Motsinger, 2005), factor analysis (Basilevsky, 1994), and wavelet-based dimension reduction (Chang & Vidakovic, 2002;Cho et al., 2009;Donoho & Johnstone, 1994;Jung et al., 2006). The dimension reduction strategy has important benefits that can be measured not only in terms of computational time savings, but also in accuracy improvement. In the novel PDCM, the wavelet-based dimension reduction is applied in Step 1. The wavelets approach was selected because of several attractive attributes, among which the following two are most relevant: (a) wavelets adapt effectively to spatial features of a function such as discontinuities and varying frequency behavior; (b) wavelets have efficient O(n) algorithms to do transformations (Mallot, 1999). Based on perceived knowledge, wavelet-based techniques are applied to obtain a well-fitted reduced-dimension representation of the original data.
Discrete Wavelet Transformation (DWT) is often used for dimension reduction (also known as shrinkage or threshold). The data constructed with the scaling and wavelet functions based on orthogonal base in time domain is as follows: where Z is the set of all the possible integer values, c L, k is the coarse level coefficient and d L, k is the finer level coefficient. Let y m = [y m1 , y m2 , ⋯, y mN ] T is an m tℎ observed sample.
For a single sample, the DWT procedure uses the orthonormal matrix W of dimension N × N to find the wavelet coefficient where J > L, L corresponds to the lowest decomposition level. Small absolute values of wavelet coefficients are undesirable since they may be influenced more by noise than by information. On the other hand, large absolute values are more influenced by information than noise. This observation motivates the development of threshold methods. There are two threshold rules usually referred to as soft and hard thresholds. The soft rule is a continuous function of the data that shrinks each observation, while the hard rule retains unchanged only large observations (Donoho & Johnstone, 1994). The hard and soft threshold methods are defined as following: here λ is the threshold value. The threshold method can be used not only for data reduction but also for de-noising. One of the 5 different thresholding methods will be selected based on the performance of PDCM (Multi-Choice). Also, there are 5 different wavelet-based thresholding methods which will handle high-dimensional data. Therefore, the dimension reduction method is referred as Multi-Choice Wavelet Thresholding (MCWT).

VisuShrink (VS)
VisuShrink is a soft thresholding technique that applies a universal threshold proposed by Donoho and Johnstone (1994). The VisuShrink threshold is given by σ 2logN , where N is the number of wavelet coefficients, and σ is the standard deviation of the wavelet coefficients (or noise standard deviation). When ε i is a white noise sequence, independent and identically distributed as N (0,1), then as N ∞, P max ε i > 2logN 0. That is, the maximum of the N values will most likely be smaller than the universal threshold. The VisuShrink guarantees a noise free reconstruction. However, when setting the threshold large, the degree of data fitting may be unsatisfactory. For multiple curves or samples, the VS procedure uses the union (VisuShrinkUnion, VSU) or intersection (VisuShrinkIntersection, VSI) of data sets in the selection of wavelet coefficients (Jung et al., 2006).

VertiShrink (VERTI)
Chang and Vidakovic (2002) developed a Stein-type shrinkage method, known as Verti-Shrink, to maximize the predictive density under appropriate model assumptions regarding wavelet coefficients. The main goal of VertShrink is the estimation of the baseline curve by using the average of block vertical coefficients. The estimated wavelet coefficients are given by: , M is the number of curves and σ is the standard deviation of the wavelet coefficients.

Vertical-Energy-Thresholding (VET)
VET was proposed by Jung et al. (2006). The procedure is based on the concept of energy of a function with some smoothness, since it is often concentrated on few coefficients, while the energy of noise is still spread over all coefficients in the wavelet domain. The vertical energy of wavelet coefficients is defined by where d mj is the wavelet coefficient at the j th wavelet position for the m th data curve, m = 1, 2 ,...,M .
The VET method minimizes the overall relative reconstruction error (ORRE), formulated below, to determine a threshold value, namely λ:

MultiScale-Vertical-Energy-Thresholding (MSVET)
Since the VET procedure does not consider the scale information of wavelets, an improved procedure proposed by Cho et al. (2009) and known as multi-scale vertical energy thresholding (MSVET) obtains a different optimal thresholding value for each scale by extending the idea of the VET procedure. In the MSVET procedure, the multi-scale overall relative reconstruction error (MSORRE) is defined as follows to determine the threshold values,

Step 2: Variable Selection Based on Information Complexity and Recursive Feature Elimination
Once the reduced sample space is determined in Step 1, the decision regarding which of the remaining variables should be selected for ranking is made on the basis of minimal information complexity values, following the Information Complexity Performance Testing with Recursive Feature Elimination (ICOMP PERF -RFE) procedure proposed by Bozdogan and Baek (2018). Since this procedure resulted in better performance than other RFE-based methods. Also, this procedure essentially generates a stabilized and smoothed covariance estimator to calculate the information complexity measure, and, finally performs ranking using recursive elimination on the remaining variables. The development of information complexity for the discriminant analysis is evaluated using the modified maximal entropic complexity C 1F where s is the rank of Σ, λ j is the j th eigenvalue of Σ > 0, j = 1, 2 ,⋯, s and λ a is arithmetic means of the eigenvalues ICOMP PERF can be evaluated as indicated below: where lack of fit is assessed by means of the first three terms and complexity by the fourth one. In the above expression, σ 2 is the estimated mean squared error given by and Σ ST A_CSE is the stabilized and smoothed convex sum covariance matrix estimator given by where Σ ST A is the stabilized covariance matrix proposed by Thomaz (2004), h is the number of variables, I ℎ is h×h identity matrix, and k is chosen such that Specific details on this procedure are provided by Bozdogan and Baek (2018).

Step 3: Cognition Accuracy of Selected Models
When the ranking decision is finished in Step 2, the corresponding accuracies are determined using the corresponding cognition sets and the support vector machines for classification (SVMC) described below. Once the accuracies are calculated for the selected models the most-accurate one is chosen. The SVMC find an optimal separating hyperplane that maximizes the margin between the classes (Vapnik, 1995). Consider the case of classifying a set of linearly separating data into two groups. Assume a set of training data is given by [(x 1 , y 1 ), (x 2 , y 2 ),...,(x n , y n )] where x i ∈ ℜ p is an input vector, y i ∈ −1,1 is a binary class index, and n is the size of the training data set. Then, a decision boundary that partitions the underlying vector space into two classes can be represented by the hyperplane where w is the weight vector and b is the bias. The objective of SVMC is to find a maximum margin decision boundary between two parallel hyperplanes, w T x + b = 1 and where K(x i T , x j ) is the kernel function and C is a predefined coefficient. Kernel functions used in the numerical experiments are described in Table 1. Table 1 Kernel Functions The point x o with coordinates corresponding to new data can be classified as indicated below: and Class 2: where α ov and b ov are optimal values found based on the training data. A classification example based on the PDCM is illustrated in Figure 2.

Figure 2
Classification Example Based on PDCM with SVMC

Experimental Results
In order to emphasize the effectiveness of the PDCM, it will be applied to three different data sets (Heart, Fat, & Handwritten) used for experiments with low-dimension high sample size, and high-dimension low sample size. The data sets are divided as 3 sets; training (50% of total set), cognition (10% of total set), and test (40% of total set) sets.
For the wavelet transformation of the three data sets, the linear padding suggested by (Strang & Nguyen, 1997) is applied. Also, well-known MATLAB software was used to get the experimental results. MATLAB software gives several advantages such as easy calculation for data matrix, vector operation, easy plotting, and function operations with MATLAB toolbox. This article documents the comparison of the PDCM to the following procedures: a. SVMC recursive feature elimination (SVMC-RFE) (Guyon et al., 2002): The SVMC-RFE is used weight as a criterion to rank each feature using support vector machines for classification and recursive feature elimination algorithm. b. Two-stage method (Cho et al., 2009): The Two-stage method is used multi-scale vertical energy thresholding (MSVET) to reduced dimension and applied support vector machines for classification and recursive feature elimination to select important wavelet coefficients based on gradient.

Heart Data (44 Variables)
The data set includes 267 samples and 44 variables on cardiac single proton emission computed tomography (SPECT) images with two categories, i.e., normal and abnormal. There are 55 normal and 212 abnormal classes . The data set is divided into 134 samples as a training set, 13 samples as a cognition set, and 120 samples as a test set. The training set has 25 normal and 109 abnormal classes. The cognition set has 1 normal and 12 abnormal classes. The test set has 29 normal and 91 abnormal classes 1 . Table 2 and Table 3 show comparison results in terms of the variables selected from ICOMP PERF -RFE, the cognition accuracy, and the test accuracy. Cauchy and inverse multi-quadratic kernel functions are used in Table 2 and Table 3, respectively.  As observed in Table 2, PDCM achieves better cognition accuracies comparing to SVMC-RFE and two-stage and yields more accurate results in test. Also, as shown in Table  3, PDCM (visu intersect) and two-stage both reach the highest test accuracy, although two-stage requires fewer variables.

Handwritten Data (240 Variables)
This data set has features of handwritten numerals from 0 to 9 extracted from a collection of Dutch utility maps. The entire set consists of 200 samples digitized in binary images per numerals and six different variable sets: 76 fourier coefficients, 216 profile correlations, 64 Karhunen-Love coefficients, 240 pixel averages, 47 Zernike moments and 6 morphological features (Van Breukelen et al., 1998). Two of six variable sets (216 profile correlations and 240 pixel averages) are used for the experiment. Two (0 and 9) out of 10 numerals are selected to verify the proposed method. Each numeral has 100 samples which are included in the experimental data set. For the experiment, 100 samples are used as a training set, 10 samples are used as a cognition set and 90 samples are used as a test set. The training set has 45 zero-numeral and 55 nine-numeral classes. The cognition set has 4 zero-numeral and 6 nine-numeral classes. The test set has 51 zero-numeral and 39 nine-numeral classes 3 . Table 6 and Table 7 show comparison results in terms of the selected variables from ICOMP PERF -RFE, the cognition accuracy, and the test accuracy. Cauchy and inverse multi-quadratic kernel functions are used in Table 6 and Table 7, respectively. As shown in the tables, PDCM reaches 100% cognition accuracy levels as did 3) The data is available at https://archive.ics.uci.edu/ml/datasets/Multiple+Features. the SVMC-RFE, except two-stage. Furthermore, PDCM achieves a higher accuracy level than the other methods for the test set, except PDCM (msvet) in Cauchy kernel.  82,106,154,166,217,218,223,224,225,226,230,231,232,235,236,237,238,239,240,241,246,249,250,251,257,258,261,262,264,265,273,276,277,279,280,283,284,288,289,291,294,295,298,299,300,303,304,309,310,312,318,319,325,333,334,335,337,342,348,349,350,352,358,363,364,365,367,373,374,376,378,379,380,382,387,389,391,393,394,395,397,402,404,408,410,412,422,423,424,427,428,435,436,437,441,442,443,444,445,446,447,448,449,450,451,454,455,456 109 100% 92% Two-Stage 8, 10, 11, 15 4 70% 80%

Discussion
This article documents the development and application of a novel Perception-Decision-Cognition Methodology (PDCM) for classification analysis based on the MCWT and SVMC with ICOMP PERF -RFE. Five different wavelet-based dimension reduction techniques called MCWT are applied in the perception step. It is shown that the procedure yields a good representation of the original data, using only reduced variables. The decision step is performed using a rank-based variable selection approach, using the information complexity criterion. The information complexity based variable selection approach shows a good ability to achieve reasonable variable ranks, which in turn can affect decision making. In the cognition step, the number of variables and accuracy are cognized for further discrimination.