这里会显示出您选择的修订版和当前版本之间的差别。
两侧同时换到之前的修订记录 前一修订版 后一修订版 | 前一修订版 | ||
zh:notes:important_citation [2021/02/08 20:06] pzczxs [SVM、RF、CNN模型ROC曲线] |
zh:notes:important_citation [2022/01/27 08:23] (当前版本) pzczxs 讨论状态变化了 |
||
---|---|---|---|
行 16: | 行 16: | ||
*xlwt | *xlwt | ||
*[[http://www.xpdfreader.com/|XPDF]] | *[[http://www.xpdfreader.com/|XPDF]] | ||
- | *R | + | *[[https://rstudio.com/|R]] |
*[[:zh:notes:install_parscit|ParsCit]] | *[[:zh:notes:install_parscit|ParsCit]] | ||
+ | *[[https://www.mysql.com/|MySQL]] | ||
===== Dataset ===== | ===== Dataset ===== | ||
行 49: | 行 50: | ||
利用Parscit对TXT格式数据进行解析,Parscit解析数据在<color red>./data/ParsCit/</color>,提取标题、作者、摘要、参考文献等信息,然后对一些解析错误进行人工修正。 | 利用Parscit对TXT格式数据进行解析,Parscit解析数据在<color red>./data/ParsCit/</color>,提取标题、作者、摘要、参考文献等信息,然后对一些解析错误进行人工修正。 | ||
<code bash> | <code bash> | ||
- | > cd ParsCit/bin | + | > cd ParsCit |
- | > ./citeExtract.pl -m extract_all ../demodata/sample2.txt sample2.txt.out | + | > ./run_parscit.sh valenzuela_txt |
+ | > ./run_parscit.sh zhu_txt | ||
</code> | </code> | ||
行 72: | 行 74: | ||
===== Feature Engineering ===== | ===== Feature Engineering ===== | ||
==== CIM特征 ==== | ==== CIM特征 ==== | ||
- | 将施引文献及其参考文献的标题、摘要及引用关系等信息输入CIM模型中,生成.psi和.symKL文件 | + | The database SQL files: <color red>acl_subset.sql</color> and <color red>zhu.sql</color>. These SQL files share the same table structure. |
+ | |||
+ | To import the related information from the file <color red>data/pre_title.xls</color> to MySQL database by running <color red>AclExcelImporter.java</color> in the package <color red>cn.edu.bjut.ui</color> of the project <color red>DataConverter</color>. | ||
+ | |||
+ | To import the related information from the file <color red>data/zhu_data_0924.xlsx</color> to MySQL database by running <color red>ZhuExcelImporter.java</color> in the package <color red>cn.edu.bjut.ui</color> of the project <color red>DataConverter</color>. | ||
+ | |||
+ | To convert the data to the format of CIM model by running <color red>ToCIM.java</color> in the package <color red>cn.edu.bjut.ui</color> of the project <color red>DataConverter</color>. Note that the parameter "data/CIM/acl" for Valenzuela's dataset and "data/CIM/zhu" for Zhu's dataset. | ||
+ | |||
+ | To run <color red>CIM.java</color> in the package <color red>cn.edu.bjut.ui</color> of the project <color red>CIM</color>,several files will be saved in the directory <color red>data/acl</color> and <color red>data/zhu</color>, respectively. Note that the parameter "data/acl" for Valenzuela's dataset and "data/zhu" for Zhu's dataset. | ||
+ | |||
+ | To run <color red>FromCIM.java</color> in the package <color red>cn.edu.bjut.ui</color> of the project <color red>DataConverter</color>, two files <color red>.symKL</color> and <color red>.psi</color> will be saved in the directory <color red>data</color>. Note that the parameter "data/CIM/acl" for Valenzuela's dataset and "data/CIM/zhu" for Zhu's dataset. | ||
==== 结构特征 ==== | ==== 结构特征 ==== | ||
行 95: | 行 107: | ||
==== 线索词特征 ==== | ==== 线索词特征 ==== | ||
运行<color red>cue_words.py</color>, 线索词列表在<color red>./data/cue_words.xls</color>,通过正则表达式匹配来统计出现在引文中的重要线索词和不重要线索词的数量, 导出到Excel文件中。 | 运行<color red>cue_words.py</color>, 线索词列表在<color red>./data/cue_words.xls</color>,通过正则表达式匹配来统计出现在引文中的重要线索词和不重要线索词的数量, 导出到Excel文件中。 | ||
- | <code shell> | + | <code bash> |
> Python ./cue_words.py | > Python ./cue_words.py | ||
</code> | </code> | ||
行 130: | 行 142: | ||
应用配对样本t检验,检验不同特征组之间结果是否存在显著差异。 | 应用配对样本t检验,检验不同特征组之间结果是否存在显著差异。 | ||
- | <code shell> | + | <code bash> |
> from scipy import stats | > from scipy import stats | ||
> stats.ttest_rel(G1,G2) | > stats.ttest_rel(G1,G2) | ||
</code> | </code> | ||
+ | |||
+ | ~~DISCUSSION:closed~~ |