Important Citations Identification with Semi-Supervised Classification Model

Citation information

Xin An, Xin Sun, and Shuo Xu, 2022. Important Citations Identification with Semi-Supervised Classification Model. Scientometrics, Vol. 127, No. 11, pp. 6533-6555.

Requirements

Python3
- Sklearn
- matplotlib, pandas, numpy, nltk, re, xlrd, xlwt

Datasets

Dataset I

Dataset I来源于Valenzuela et al. (2015)，收集和预处理步骤与An et al. (2021)（Note）相同，共收集456对已标注施引-被引文献对，8085对未标注施引-被引文献对。

原始数据和标注数据在data_for_semi/Valenzuela_information.xlsx；
本实验中标注数据在data_for_semi/Valenzuela_data.csv；
未标注实验数据在data_for_semi/Valenzuela_unlabeled_data.csv；
半监督不同阈值（95%-70%）下的预测数据在data_for_semi/Valenzuela_RF_predict.xlsx和Valenzuela_SVM_predict.xlsx中不同的Sheet里。

Dataset II

Dataset II来源于Zhu et al. (2015)，收集和预处理步骤与An et al. (2021)（Note）相同，共收集112篇施引文献，2685对已标注施引-被引文献对。其中施引文献共涉及10种不同学科，82篇来自Computer Science学科。

原始数据和标注数据在./data_for_semi/Zhu_information.xls；
本实验中使用数据在./data_for_semi/Zhu_data.xlsx，最后一列为学科分类。

Pre-processing

Pre-processing steps are same as those in An et al. (2021)。

Feature Engineering

Six groups of features in in An et al. (2021) are utilized here.

G1: CIM特征；G2: 结构特征；G3: 单独引用特征；G4: 作者重叠特征；G5: 线索词特征；G6: 相似度特征。

Experiment I

在Dataset I的基础上开展后续实验工作。

Supervised Learning

基于标注数据Valenzuela_data.csv，运行gridsearch.py，以下各实验均应用GridSearch，五折交叉验证优化SVM、RF模型参数。

> python ./gridsearch.py

运行supervised_14features.py，绘制Valenzuela数据集下14个特征的SVM、RF监督学习的PR和ROC性能曲线。

> python ./Valenzuela_supervised_14features.py

Semi-Supervised Learning

基于标注数据Valenzuela_data.csv和未标注数据Valenzuela_unlabeled_data.csv，运行semi_supervised.py，将已标注数据划分为5折，修改相应参数，分别在阈值为0.95，0.90，0.85，0.80，0.75，0.70下，对每一折数据执行代码，得到不同阈值下SVM和RF的平均PR和ROC。各实验均使用gridsearch.py优化模型参数。

> python ./semi_supervised.py

特征重要性对比

基于75%置信度下的SVM数据Valenzuela_SVM_predict.csv中的‘X75’Sheet和95%置信度下的RF数据Valenzuela_RF_predict.csv中的‘X95’Sheet。

保持结构特征组（G2）不变，将其他组特征分别加入模型中，通过gridsearch.py优化模型参数。在五折交叉验证下观察使用不同特征组的平均AUC-PR和AUC-ROC数值的变化，以此对各特征贡献进行评估，相应运行代码同前述监督学习。

Experiment II

在Dataset II的基础上开展后续实验工作。

Supervised Learning

基于标注数据Zhu_data.xlsx，开展相关实验工作。代码同Supervised Learning in Experiment I。

Semi-Supervised Learning

将数据集按照10%，15%，20%，25%，30%的比例分别划分伪未标注数据，保证标注数据和伪未标注数据的类别比例相同。

> from sklearn.model_selection import StratifiedShuffleSplit
> sss = StratifiedShuffleSplit(n_splits=1, test_size=0.90, train_size=0.10, random_state=7)
> for train_index, test_index in sss.split(X, y):
    x_label, x_pseudo_unlabel = X[train_index], X[test_index]
    y_label, y_pseudo_unlabel = y[train_index], y[test_index]

根据Experiment I，SVM在75%置信度半监督性能达到最高，RF在95%置信度半监督性能达到最高。因此，采取SVM-75%阈值，RF-95%阈值进行本节实验。相应运行代码同前述半监督学习。

运行Zhu_semi_results_comparison_figure.py，绘制不同划分比例下半监督学习效果对比图。

> python ./Zhu_semi_results_comparison_figure.py

Semi-Supervised Learning on Computer Science

仅对数据集Zhu_data.xlsx中的Computer Science学科的数据子集上进行半监督实验。

> computer_science = Zhu_data[Zhu_data['discipline']=='CS']

由于RF模型作为基分类器时没有样本满足阈值要求，因此本处仅进行SVM-75%的实验。相应运行代码同前述半监督学习。

特征重要性对比

基于数据集Zhu_data.xlsx，保持结构特征组（G2）不变，将其他组特征分别加入模型中，通过gridsearch.py优化模型参数，在五折交叉验证下观察使用不同特征组的平均AUC-PR和AUC-ROC数值的变化，以此对各特征贡献进行评估，相应运行代码同前述监督学习。

Discussion

4个学科监督学习结果对比

为验证不同学科可能遵循不同的引文模式，选取数据量排名前4的学科Computer Science, Genetics, Biophysics, Ecology分别进行监督学习。

> computer_science = Zhu_data[Zhu_data['discipline']=='CS']
> genetics =  Zhu_data[Zhu_data['discipline']=='Genetics']
> biophysics =  Zhu_data[Zhu_data['discipline']=='Biophysics']
> ecology =  Zhu_data[Zhu_data['discipline']=='Ecology']

监督学习代码同前述。

案例分析

随机选取Dataset II中id=Z002的施引文献进行案例分析，以验证Dataset II与Dataset I采用了不同的重要引文标注模式，即存在施引文献作者认为不重要的引文，但实际对施引-被引之间的知识扩散重要。

硕风徐徐

侧边栏

目录

Important Citations Identification with Semi-Supervised Classification Model

Citation information

Requirements

Datasets

Dataset I

Dataset II

Pre-processing

Feature Engineering

Experiment I

Supervised Learning

Semi-Supervised Learning

特征重要性对比

Experiment II

Supervised Learning

Semi-Supervised Learning

Semi-Supervised Learning on Computer Science

特征重要性对比

Discussion

4个学科监督学习结果对比

案例分析

评论

硕风徐徐

用户工具

站点工具

侧边栏

目录

Important Citations Identification with Semi-Supervised Classification Model

Citation information

Requirements

Datasets

Dataset I

Dataset II

Pre-processing

Feature Engineering

Experiment I

Supervised Learning

Semi-Supervised Learning

特征重要性对比

Experiment II

Supervised Learning

Semi-Supervised Learning

Semi-Supervised Learning on Computer Science

特征重要性对比

Discussion

4个学科监督学习结果对比

案例分析

评论

页面工具