Classification by Tree-based Data Mining Algorithms

Techniques for classifying data using data mining are now a day prevalent in agriculture. The method of classifying seeds involves grouping various seed varieties according to their morphological characteristics. To accomplish categorization of the typical Charotar region (generally comprising Anand and Kheda districts of the Gujarat State of India) GW273, GW496, GW322, LOK-1, and GDW1255 wheat varieties, Weka was used.

The features used are area, perimeter, solidity, aspect ratio, major and minor axis of seed kernel, Hue, Saturation, Value, and SF1 (empirical). Features reduction was done using Information Gain (IG) and its modified version Gain Ratio (GR). For classification, we used purely tree-based machine learning algorithms viz. J48, Random Forest, Hoeffding Tree, Logistic Model Tree (LMT), and REPTree. LMT- logistics regression method gives a higher accuracy of 96.4% compared to other classifiers.

Hoeffding Tree classifiers stood second with 96% accuracy. For validation, 10-fold cross-validation was used. By reducing the number of folds in cross-validation performance of most of the algorithms decreased except J48. The percentage of correctly classified instances increased for all algorithms when features were selected by GR except for J48.

Tree-based Data Mining Algorithms

J48: J48 is the implementation of Quinlan’s C4.5 algorithm to generate a trimmed decision tree. Initially, information is split into smaller subsets based on standardized data gain obtained by dividing the data by an attribute. This split process ends if each subset is equivalent to a class. Using the class’s anticipated estimation, J48 creates a decision node. The decision tree using J48 can even handle varying attribute costs and lost or missing trait estimations of the data.

RepTree: In various iterations, the regression tree concept is used to build many trees. Later, a finest tree is selected from it. Pruning of the tree is done on the mean square error value.

Random Forest: This ensemble classifier combines the concepts of ‘bagging’ and the selection of traits, Later Amit and Geman using controlled variation proposed the construction of a collection of decision trees.

LMT: It is an accurate and understandable classifier algorithm that combines the benefits of two complementing methods i.e., decision tree and logistic regression models

Hoeffding Tree: Base of this algorithm is a decision tree. To derive a decision tree Hoeffding bound is used. Hoeffding limits are used to determine how many instances must run to reach a particular degree of confidence. If the distribution-generating instances remain constant over time, it can learn from large data.

Performance

The performance of each algorithm is compared based on the confusion matrix. It provides correctly, and incorrectly classified instances based on input. It shows True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) values. The metrics like Kappa Statistic, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE), and Root Relative Squared Error (RRSE) are also used for comparing performance. True Positive Rate, False Positive Rate, Precision, Recall, F1-score, and Matthews correlation coefficient (MCC) values are taken into consideration for each algorithm. The following formulas were used for the calculations.

Analysis

All Tree-based ensemble models are generally driven by three matrices, which comprise the number of independent variables, the form of the regression line, and the type of estimated variable. Five tree-based ensemble models are evaluated viz. J48, REPTree, Random Forest, LMT, and Hoeffding Tree. The performance increased as the number of folds was increased of cross-validation with 9 traits. The classification performance with 7 attributes was found bit lower compared to 9 traits. As the classification accuracy of all algorithms is above 90% it proves that all are good classifiers for sorting wheat varieties selected for this study. LMT based on linear regression is best in the group, as it reveals 96.4% accuracy when applied for 10-fold cross-validation and 9 traits.

To optimize processing time, feature selection was done using IG using Ranker as a search method. The 7 traits were found promising and can play an important role in classification. Using a modified version of IG, Gain Ratio (GR) with Ranker as search method, 9 attributes namely Area, Major, Perimeter, Solidity, Equidial, SF1, H, S, and V were found prominent.

Conclusion

Source