Making the Cut: Feature Selection for the Classification of Political Science Documents
While it is commonly recognized that the availability of digitized text represents an enormous opportunity for political scientists to access and structure new data, relatively little research on text analysis has been tailored to working with the types of text commonly seen in Political Science. This is especially true in the area of automated document classification, which is used primarily as a method for improving the efficiency of data collection projects and where the peculiarities of Political Science documents may warrant different methods than those found to be successful in the benchmark sets (e.g. Reuters-21578 and RCV1). In this study three feature selection methods specifically tailored for improving the efficiency of document classification for Political Science research are compared to a standardized approach. For the purposes of comparison, all other aspects of classification are kept constant across evaluations. An analysis of the results yields two interesting findings. First, substantial improvements in precision can be had by not removing named entities and minor improvements in precision can be had by classifying sources separately. Second, feature selection appears to be compatible with ensemble learning, as the highest precision is found in the intersections of the four classified sets.