Abstract:
Under the background of frequent adjustment of railway train operation plan, the train timetable data set is characterized by large amount of data and too many attributes, large difference in the number of timetable data records of different trains and similar attribute values of timetable data records of the same train. Therefore, train timetable data analysis and mining are faced with the problem of unbalanced data set. For this, the imbalanced dataset preprocessing algorithm for train timetables based on correlation analysis and clustering is put forward, in which based on the correlation analysis of train timetable attributes and train operation index (i.e., percentage of passenger seats utilization per train), similar data records containing redundant information can be effectively merged to reduce the proportion of such similar data records in the data set while the main information contained in similiar data records can be retained, thus weakening negative affects of the imbalanced train timetable data sets on subsequent data analysis and reducing the adverse impact of too much similar data on the application effects of the data analysis models and helping improve the prediction accuracy of the models.