这里会显示出您选择的修订版和当前版本之间的差别。
两侧同时换到之前的修订记录 前一修订版 后一修订版 | 前一修订版 | ||
zh:notes:common_specialty [2020/08/28 10:57] pzczxs [Syntactic and Lexical Complexity before Filtering Stopwords] |
zh:notes:common_specialty [2022/11/08 07:33] (当前版本) pzczxs [Citation Information] |
||
---|---|---|---|
行 1: | 行 1: | ||
- | ====== An Approach for Detecting the Commonality and Specialty between Scientific Publications and Patents ====== | + | ====== Commonality and Specialty Detection ====== |
+ | |||
+ | ===== Citation Information ===== | ||
+ | Shuo Xu, Ling Li, Xin An, Liyuan Hao, and Guancan Yang, 2021. [[https://doi.org/10.1007/s11192-021-04085-9|An Approach for Detecting the Commonality and Specialty between Scientific Publications and Patents]]. //Scientometrics//, Vol. 126, No. 9, pp. 7445-7475. | ||
===== Requirements ===== | ===== Requirements ===== | ||
行 57: | 行 60: | ||
To import the related information into the database with <color red>PatentURLImporter.java</color> in the package <color red>cn.edu.bjut.ui</color> from the directory <color red>data/patents/url</color>. | To import the related information into the database with <color red>PatentURLImporter.java</color> in the package <color red>cn.edu.bjut.ui</color> from the directory <color red>data/patents/url</color>. | ||
- | To update the country information with <color red>PatentCountryUpdater.java</color> in the package <color red>cn.edu.bjut.ui</color>. | + | To update the country information with <color red>PatentOriginalCountryUpdater.java</color> in the package <color red>cn.edu.bjut.ui</color>. |
+ | <!-- | ||
To remove manually the irrelevant information from the abstract for the patent id = 5875. | To remove manually the irrelevant information from the abstract for the patent id = 5875. | ||
<code sql> | <code sql> | ||
SELECT * FROM patent WHERE id = 5875; | SELECT * FROM patent WHERE id = 5875; | ||
</code> | </code> | ||
+ | --> | ||
===== Export All Documents in the Format of genia ===== | ===== Export All Documents in the Format of genia ===== | ||
行 81: | 行 86: | ||
To run <color red>HMMLDA.java</color> in the package <color red>cn.edu.bjut.ui</color> from another project <color red>HMM-LDA</color>. If you want, the resulting parameters can be set through a configure file <color red>HMMLDA.properties</color>, located in the directory <color red>conf</color>. | To run <color red>HMMLDA.java</color> in the package <color red>cn.edu.bjut.ui</color> from another project <color red>HMM-LDA</color>. If you want, the resulting parameters can be set through a configure file <color red>HMMLDA.properties</color>, located in the directory <color red>conf</color>. | ||
===== Syntactic and Lexical Complexity before Filtering Stopwords ===== | ===== Syntactic and Lexical Complexity before Filtering Stopwords ===== | ||
- | To run <color red>synatic complexity_Patent.ipynb</color> and <color red>synatic complexity_Article.ipynb</color> in the directory <color red>indicators/before</color> to calculate syntatic complexity indicators (Title/Abstract/Abstract average sentence Length); | + | To run <color red>synatic complexity_Patent.py</color> and <color red>synatic complexity_Article.py</color> in the directory <color red>indicators/before</color> to calculate syntatic complexity indicators (Title/Abstract/Abstract average sentence Length); |
- | To run <color red>Abs_Sen_Complexity_Patent.ipynb</color> and <color red>Abs_Sen_Complexity_Article.ipynb</color> in the directory <color red>indicators/before</color> to save parsed tree structure, and then with the tool <color red>stanford-tregex</color> to calculate sentence complexity; | + | To run <color red>Abs_Sen_Complexity_Patent.py</color> and <color red>Abs_Sen_Complexity_Article.py</color> in the directory <color red>indicators/before</color> to save parsed tree structure, and then with the tool <color red>stanford-tregex</color> to calculate sentence complexity; |
- | To run <color red>lexical complexity_Patent_Title.ipynb<color>, <color red>lexical complexity_Patent_Abs.ipynb</color> and <color red>lexical complexity_Article_Title.ipynb<color>, <color red>lexical complexity_Article_Abs.ipynb</color> in the directory to calculate lexical complexity indicators (Lexical Diversity/ Sophistication /Density); | + | To run <color red>lexical complexity_Patent_Title.py</color>, <color red>lexical complexity_Patent_Abs.py</color> and <color red>lexical complexity_Article_Title.py</color>, <color red>lexical complexity_Article_Abs.py</color> in the directory <color red>indicators/before</color> to calculate lexical complexity indicators (Lexical Diversity/ Sophistication /Density); |
===== Syntactic and Lexical Complexity after Filtering Stopwords ===== | ===== Syntactic and Lexical Complexity after Filtering Stopwords ===== | ||
+ | To run <color red>Mean_synatic_complexity_Patent.py</color> and <color red>Mean_synatic_complexity_Article.py</color> in the directory <color red>indicators/after</color> to calculate the meaningful synatic complexity indicators (Title/Abstract/Abstract average sentence Length); | ||
+ | |||
+ | To run <color red>Mean_lexical_complexity_Patent_Title.py</color>, <color red>Mean_lexical complexity_Patent_Abs.py</color>, <color red>Mean_lexical complexity_Article_Title.py</color>, and <color red>Mean_lexical_complexity_Article_Abs.py</color> in the directory <color red>indicators/after</color> to calculate the meaningful lexical complexity indicators (Lexical Diversity/ Sophistication /Density); | ||
+ | |||
+ | ===== Descriptive Statistics and Word Cloud ===== | ||
+ | To run <color red>Statistics.py</color> and <color red>overlap.py</color> in the directory <color red>indicators/before</color> to count the number of (overlapped) tokens and (overlapped) unique words; | ||
+ | |||
+ | To run <color red>Mean_Statistics.py</color>, <color red>Mean_overlap.py</color> and <color red>Non_overlap.py</color> in the directory <color red>indicators/after</color> to count the number of (overlapped) tokens and (overlapped) unique words, and save overlapped words with their corresponding word frequencies. | ||
+ | |||
===== Format Data for the CDTM Model ===== | ===== Format Data for the CDTM Model ===== | ||
- | To run <color red>Trans_CDTM.ipynb</color> in the directory <color red>CDTM-Test</color>. In this time, a dictionary will be generated with the extension files <color red>.word.vocab</color> and <color red>ID.csv</color>. Then, with the help of Excel, to generate two documents with the extension name <color red>.docs</color> and <color red>.corpus</color>. | + | To run <color red>Trans_CDTM.py</color> in the directory <color red>CDTM-Test</color>. In this time, a dictionary will be generated with the extension files <color red>.word.vocab</color> and <color red>ID.csv</color>. Then, with the help of Excel, to generate two documents with the extension name <color red>.docs</color> and <color red>.corpus</color>. |
===== Estimate a CDTM Model ===== | ===== Estimate a CDTM Model ===== | ||
To run <color red>CdtmParameterTuning.java</color> in the package <color red>cn.edu.bjut.ui</color>. The perplexity will be obtained for each candidate value combination of the number of common topics, the number of topics specific to scientific publications, and the number of topics specific to patents. | To run <color red>CdtmParameterTuning.java</color> in the package <color red>cn.edu.bjut.ui</color>. The perplexity will be obtained for each candidate value combination of the number of common topics, the number of topics specific to scientific publications, and the number of topics specific to patents. | ||
行 95: | 行 109: | ||
Then, perplexity values are imported to MATLAB software, and to run <color red>TuneParam.m</color>. A figure will be shown for the perplexity with different number of topics. By observing this figure, the optimal number of common and special topics will be determined. To run <color red>Cdtm.java</color> in the package <color red>cn.edu.bjut.ui</color>, the final results will be obtained. | Then, perplexity values are imported to MATLAB software, and to run <color red>TuneParam.m</color>. A figure will be shown for the perplexity with different number of topics. By observing this figure, the optimal number of common and special topics will be determined. To run <color red>Cdtm.java</color> in the package <color red>cn.edu.bjut.ui</color>, the final results will be obtained. | ||
===== Connections amongst Common and Special Topics ===== | ===== Connections amongst Common and Special Topics ===== | ||
- | To run <color red>NetworkConverter.java</color> in the package <color red>cn.edu.bjut.ui</color>. One map file and one network file will be generated. Thus, one can import these two files into the software //VOSviewer//. | + | To run <color red>NetworkConverter.java</color> in the package <color red>cn.edu.bjut.ui</color>. One map file and one network file will be generated. Thus, one can import these two files into the software //VOSviewer//. |
+ | |||
+ | ~~DISCUSSION:closed~~ |