这是本文档旧的修订版！

Math Software Dataset Construction

Citation Information

Data Sources

zbMATH
Web of Science
English Wikipedia
Word Embedding: GloVe

Create Database

The database SQL file: math_software.sql.

zbMATH

Web of Science

Project: WoSImporter

Preprocess Articles

One can preprocess all files in BibTeX format by running BibTeXPreprocessor.java in the package cn.edu.bjut.ui.

> nohup preprocess-wos.sh ../dataset/WoS/papers > preprocess-wos.log 2>&1

Import Articles

The articles can be imported to the database by running ArticleBibTexImporter.java in the package cn.edu.bjut.ui. It is noteworthy that the parameters checkFlag and citedArticleFlag should be set to false. In this procedure, the DOI names of cited articles are pre-processed with the cleaning method in Xu et al. (2019).

> nohup ./import-wos.sh 0 WoS > import-wos.log 2>&1
> nohup ./import-wos-keyword WoS > import-wos-keyword.log 2>&1

Converters

Gram Counting for PMI Calculation

Wikepedia dumps (version: 2026-03-01) should be downloaded in advance.

From downloaded English Wikipedia, the plain texts can be extracted by running the following codes. Please refer to gensim.scripts.segment for more detial.

> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p10p1147431.xml.bz2 -o enwiki-2026-03-01-p10p1147431.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p1147434p3987701.xml.bz2 -o enwiki-2026-03-01-p1147434p3987701.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p3987703p8213792.xml.bz2 -o enwiki-2026-03-01-p3987703p8213792.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p8213793p13295371.xml.bz2 -o enwiki-2026-03-01-p8213793p13295371.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p13295373p18816201.xml.bz2 -o enwiki-2026-03-01-p13295373p18816201.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p18816202p24038461.xml.bz2 -o enwiki-2026-03-01-p18816202p24038461.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p24038462p29075629.xml.bz2 -o enwiki-2026-03-01-p24038462p29075629.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p29075630p34204620.xml.bz2 -o enwiki-2026-03-01-p29075630p34204620.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p34204621p39293698.xml.bz2 -o enwiki-2026-03-01-p34204621p39293698.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p39293699p43920660.xml.bz2 -o enwiki-2026-03-01-p39293699p43920660.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p43920661p48725620.xml.bz2 -o enwiki-2026-03-01-p43920661p48725620.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p48725621p53857278.xml.bz2 -o enwiki-2026-03-01-p48725621p53857278.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p53857280p58693957.xml.bz2 -o enwiki-2026-03-01-p53857280p58693957.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p58693958p63265982.xml.bz2 -o enwiki-2026-03-01-p58693958p63265982.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p63265983p67638983.xml.bz2 -o enwiki-2026-03-01-p63265983p67638983.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p67638984p71810319.xml.bz2 -o enwiki-2026-03-01-p67638984p71810319.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p71810320p76318043.xml.bz2 -o enwiki-2026-03-01-p71810320p76318043.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p76318044p80915674.xml.bz2 -o enwiki-2026-03-01-p76318044p80915674.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p80915675p82539885.xml.bz2 -o enwiki-2026-03-01-p80915675p82539885.json.gz

Then, one- and two-grams statistics can be obtained by running the following code (cf. Project GramCounter).

> nohup ./run-gram-counter.sh > run-gram-counter.log 2>&1

Export for Gaussian AT credit Model

Project: MathSoftware

The citing articles can be exported for Gaussian AT^credit model by running ToGaussianATCreditConverter.java in the pakcage cn.edu.bjut.converter.

Extract Themes

Project: GaussianATModelWithCredit

> java -jar .\GaussianATModelWithCredit.jar -D 50 -n 2000 -s 7 -fb math_software/math_software -K 200

Calculate Diversity Indicators

The following three diversity indicators are calculated: (1) Rao-Stirling (), DIV, and Diversity.

硕风徐徐

侧边栏

目录

Math Software Dataset Construction

Citation Information

Data Sources

Create Database

zbMATH

Web of Science

Preprocess Articles

Import Articles

Converters

Gram Counting for PMI Calculation

Export for Gaussian AT credit Model

Extract Themes

Calculate Diversity Indicators

硕风徐徐

用户工具

站点工具

侧边栏

目录

Math Software Dataset Construction

Citation Information

Data Sources

Create Database

zbMATH

Web of Science

Preprocess Articles

Import Articles

Converters

Gram Counting for PMI Calculation

Export for Gaussian AT credit Model

Extract Themes

Calculate Diversity Indicators

页面工具