这是本文档旧的修订版!
The database SQL file: math_software.sql.
Project: WoSImporter
One can preprocess all files in BibTeX format by running BibTeXPreprocessor.java in the package cn.edu.bjut.ui.
> nohup preprocess-wos.sh ../dataset/WoS/papers > preprocess-wos.log 2>&1
The articles can be imported to the database by running ArticleBibTexImporter.java in the package cn.edu.bjut.ui. It is noteworthy that the parameters checkFlag and citedArticleFlag should be set to false. In this procedure, the DOI names of cited articles are pre-processed with the cleaning method in Xu et al. (2019).
> nohup ./import-wos.sh 0 WoS > import-wos.log 2>&1 > nohup ./import-wos-keyword WoS > import-wos-keyword.log 2>&1
Wikepedia dumps (version: 2026-03-01) should be downloaded in advance.
From downloaded English Wikipedia, the plain texts can be extracted by running the following codes. Please refer to gensim.scripts.segment for more detial.
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p10p1147431.xml.bz2 -o enwiki-2026-03-01-p10p1147431.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p1147434p3987701.xml.bz2 -o enwiki-2026-03-01-p1147434p3987701.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p3987703p8213792.xml.bz2 -o enwiki-2026-03-01-p3987703p8213792.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p8213793p13295371.xml.bz2 -o enwiki-2026-03-01-p8213793p13295371.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p13295373p18816201.xml.bz2 -o enwiki-2026-03-01-p13295373p18816201.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p18816202p24038461.xml.bz2 -o enwiki-2026-03-01-p18816202p24038461.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p24038462p29075629.xml.bz2 -o enwiki-2026-03-01-p24038462p29075629.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p29075630p34204620.xml.bz2 -o enwiki-2026-03-01-p29075630p34204620.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p34204621p39293698.xml.bz2 -o enwiki-2026-03-01-p34204621p39293698.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p39293699p43920660.xml.bz2 -o enwiki-2026-03-01-p39293699p43920660.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p43920661p48725620.xml.bz2 -o enwiki-2026-03-01-p43920661p48725620.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p48725621p53857278.xml.bz2 -o enwiki-2026-03-01-p48725621p53857278.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p53857280p58693957.xml.bz2 -o enwiki-2026-03-01-p53857280p58693957.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p58693958p63265982.xml.bz2 -o enwiki-2026-03-01-p58693958p63265982.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p63265983p67638983.xml.bz2 -o enwiki-2026-03-01-p63265983p67638983.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p67638984p71810319.xml.bz2 -o enwiki-2026-03-01-p67638984p71810319.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p71810320p76318043.xml.bz2 -o enwiki-2026-03-01-p71810320p76318043.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p76318044p80915674.xml.bz2 -o enwiki-2026-03-01-p76318044p80915674.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p80915675p82539885.xml.bz2 -o enwiki-2026-03-01-p80915675p82539885.json.gz
Then, one- and two-grams statistics can be obtained by running the following code (cf. Project GramCounter).
> nohup ./run-gram-counter.sh > run-gram-counter.log 2>&1
Project: MathSoftware
The citing articles can be exported for Gaussian ATcredit model by running ToGaussianATCreditConverter.java in the pakcage cn.edu.bjut.converter.
Project: GaussianATModelWithCredit
> java -jar .\GaussianATModelWithCredit.jar -D 50 -n 2000 -s 7 -fb math_software/math_software -K 200
The following three diversity indicators are calculated: (1) Rao-Stirling (), DIV, and Diversity.