The database SQL file: math_software.sql.
Project: WoSImporter
One can preprocess all files in BibTeX format by running BibTeXPreprocessor.java in the package cn.edu.bjut.ui.
> nohup preprocess-wos.sh ../dataset/WoS/papers > preprocess-wos.log 2>&1
The articles can be imported to the database by running ArticleBibTexImporter.java in the package cn.edu.bjut.ui. It is noteworthy that the parameters checkFlag and citedArticleFlag should be set to false. In this procedure, the DOI names of cited articles are pre-processed with the cleaning method in Xu et al. (2019).
> nohup ./import-wos.sh 0 WoS > import-wos.log 2>&1 > nohup ./import-wos-keyword WoS > import-wos-keyword.log 2>&1
Project: MathSoftware
The citing articles can be exported for Gaussian ATcredit model by running ToGaussianATCreditConverter.java in the pakcage cn.edu.bjut.converter.
Wikepedia dumps (version: 2026-03-01) should be downloaded in advance.
From downloaded English Wikipedia, the plain texts can be extracted by running the following codes. Please refer to gensim.scripts.segment for more detial.
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p10p1147431.xml.bz2 -o enwiki-2026-03-01-p10p1147431.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p1147434p3987701.xml.bz2 -o enwiki-2026-03-01-p1147434p3987701.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p3987703p8213792.xml.bz2 -o enwiki-2026-03-01-p3987703p8213792.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p8213793p13295371.xml.bz2 -o enwiki-2026-03-01-p8213793p13295371.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p13295373p18816201.xml.bz2 -o enwiki-2026-03-01-p13295373p18816201.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p18816202p24038461.xml.bz2 -o enwiki-2026-03-01-p18816202p24038461.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p24038462p29075629.xml.bz2 -o enwiki-2026-03-01-p24038462p29075629.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p29075630p34204620.xml.bz2 -o enwiki-2026-03-01-p29075630p34204620.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p34204621p39293698.xml.bz2 -o enwiki-2026-03-01-p34204621p39293698.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p39293699p43920660.xml.bz2 -o enwiki-2026-03-01-p39293699p43920660.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p43920661p48725620.xml.bz2 -o enwiki-2026-03-01-p43920661p48725620.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p48725621p53857278.xml.bz2 -o enwiki-2026-03-01-p48725621p53857278.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p53857280p58693957.xml.bz2 -o enwiki-2026-03-01-p53857280p58693957.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p58693958p63265982.xml.bz2 -o enwiki-2026-03-01-p58693958p63265982.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p63265983p67638983.xml.bz2 -o enwiki-2026-03-01-p63265983p67638983.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p67638984p71810319.xml.bz2 -o enwiki-2026-03-01-p67638984p71810319.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p71810320p76318043.xml.bz2 -o enwiki-2026-03-01-p71810320p76318043.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p76318044p80915674.xml.bz2 -o enwiki-2026-03-01-p76318044p80915674.json.gz > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p80915675p82539885.xml.bz2 -o enwiki-2026-03-01-p80915675p82539885.json.gz
Then, one- and two-grams statistics can be obtained by running the following code (cf. Project WikipediaTool).
> java -jar .\WikipediaTool.jar -t 0 -i ..\Wikipedia\ -o ..\Wikipedia-token\ > tokenization.log > java -jar .\WikipediaTool.jar -t 1 -i ..\Wikipedia-token\ -o ..\Wikipedia-token-cleaned\ > cleaning.log > java -jar .\WikipediaTool.jar -t 2 -d ..\data\math_software.word.vocab -i ..\Wikipedia-token-cleaned\ -o ..\data\Wikipedia\ > extraction.log > java -Xmx25g -jar .\WikipediaTool.jar -t 3 -d ..\data\math_software.word.vocab -l XX -u XX -i ..\Wikipedia-token-cleaned\ -o ..\Wikipedia-statistics\ > counting.log
Project: GaussianATModelWithCredit
> java -jar .\GaussianATModelWithCredit.jar -D 50 -n 2000 -s 7 -fb math_software/math_software -K 200
The following three diversity indicators are calculated: (1) Rao-Stirling (Rao, 1982; Stirling, 2007), DIV (Leydesdorff et al., 2019), and Diversity (Mutz, 2022).