基于属性相关分析与聚类的铁路列车时刻表非均衡数据集预处理方法

孔德越; 周姗琪; 朱建生; 闫力斌; 吴颖

doi:10.3969/j.issn.1005-8451.2021.10.01

基于属性相关分析与聚类的铁路列车时刻表非均衡数据集预处理方法

Imbalanced dataset preprocessing algorithm for train timetable based on correlation analysis and clustering

摘要

摘要: 在铁路列车运行图调整日趋频繁的背景下，列车时刻表数据集具有数据量大、属性多、不同车次时刻表记录数量差异较大、相同车次时刻表记录属性值相似的特点，列车时刻表数据分析和挖掘面临着数据集不均衡问题。为此，提出基于属性相关分析与聚类的铁路列车时刻表非均衡数据集预处理方法，依据列车时刻表属性与列车运营指标（客座率）的相关分析，可有效合并蕴含冗余信息的相似数据，降低数据集中此类相似数据的占比，可削弱非均衡数据集对后续数据分析的不利影响，并能保留数据所蕴含的主要信息，减少过多相似数据对数据分析模型应用效果的不利影响，提高模型的预测准确度。

Abstract: Under the background of frequent adjustment of railway train operation plan, the train timetable data set is characterized by large amount of data and too many attributes, large difference in the number of timetable data records of different trains and similar attribute values of timetable data records of the same train. Therefore, train timetable data analysis and mining are faced with the problem of unbalanced data set. For this, the imbalanced dataset preprocessing algorithm for train timetables based on correlation analysis and clustering is put forward, in which based on the correlation analysis of train timetable attributes and train operation index (i.e., percentage of passenger seats utilization per train), similar data records containing redundant information can be effectively merged to reduce the proportion of such similar data records in the data set while the main information contained in similiar data records can be retained, thus weakening negative affects of the imbalanced train timetable data sets on subsequent data analysis and reducing the adverse impact of too much similar data on the application effects of the data analysis models and helping improve the prediction accuracy of the models.

HTML全文

参考文献(14)

施引文献

资源附件(0)