Applying Data Mining to China’s Swine Farming Industry: A Compromise Perspective of Economic, Environmental and Overall Performances
The economic and environmental performances of the swine farming industry have always resulted in heated discussions in developing countries. Exploring the relationship between these features and the producers' overall performance is the focus of this paper. For constructing multi-objective features that include the above features, a compromise approach for optimization is taken into consideration. For classifying the overall performance into different levels and detecting the effect of economic and environmental features on such features, an iteration scheme is developed in which the overall performance is treated as a target label. By neglecting this target label, a k-means clustering method is then used to help predict the producer's overall performance given their economic and environmental features. In data pre-processing, correlation analysis for feature selection shows that the producer's pollution emission and received regulation intensity largely affect its overall performance, while profit is found to be negatively correlated with pollution emission as regulation intensity is neglected. The classification result derived from the Silhouette Coefficient shows that the data set can be efficiently split into different groups in terms of the producer's overall performance. The average distance between the objects in the low-performance group is larger than that of the high-performance group. The threshold position between the two groups is found to be largely dependent on the features of pollution emission and regulation intensity. The clustering result obtained by the k-means method shows good effectiveness and efficiency in separating the objects into different groups based on various features other than the overall performance. In 2-and 3-cluster cases, the result also shows evidence of the impact of economic and environmental features on the clustering result. The cross-validation analysis under a set of randomly chosen splitting points shows an increasing out-of-sample prediction quality with increases in training sample size. As one of the by-products of this paper, the geographical distribution in the clustering result is found partially consistent with the official report from Chinas central government regarding advantageous regions within the industry. In addition to current research, the ease of using the knowledge obtained in this paper for transfer learning is discussed. © 2018 by the authors.