这里会显示出您选择的修订版和当前版本之间的差别。
| 后一修订版 | 前一修订版 | ||
|
zh:notes:math_softwares [2026/03/28 19:05] pzczxs 创建 |
zh:notes:math_softwares [2026/03/28 21:47] (当前版本) pzczxs [Create Database] |
||
|---|---|---|---|
| 行 1: | 行 1: | ||
| ====== Math Software Dataset Construction ====== | ====== Math Software Dataset Construction ====== | ||
| + | ===== Citation Information ===== | ||
| + | - | ||
| + | |||
| + | ===== Create Database ===== | ||
| + | The database SQL file: <color red>math_software.sql</color>. | ||
| + | ===== Download Data ===== | ||
| + | [[https://zbmath.org/software/|zbmath]] | ||
| + | |||
| + | ===== Word Embedding ===== | ||
| + | * [[https://nlp.stanford.edu/projects/glove/|GloVe]] | ||
| + | * | ||
| + | |||
| + | ===== PMI Calculation ===== | ||
| + | [[https://dumps.wikimedia.org/backup-index.html|Wikepedia dumps]] should be downloaded in advance. | ||
| + | |||
| + | ==== Plain Text Extraction ==== | ||
| + | Please refer to [[https://radimrehurek.com/gensim/scripts/segment_wiki.html|gensim.scripts.segment]] for more detial. | ||
| + | |||
| + | <code python> | ||
| + | > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p10p1147431.xml.bz2 -o enwiki-2026-03-01-p10p1147431.json.gz | ||
| + | > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p1147434p3987701.xml.bz2 -o enwiki-2026-03-01-p1147434p3987701.json.gz | ||
| + | > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p3987703p8213792.xml.bz2 -o enwiki-2026-03-01-p3987703p8213792.json.gz | ||
| + | > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p8213793p13295371.xml.bz2 -o enwiki-2026-03-01-p8213793p13295371.json.gz | ||
| + | > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p13295373p18816201.xml.bz2 -o enwiki-2026-03-01-p13295373p18816201.json.gz | ||
| + | > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p18816202p24038461.xml.bz2 -o enwiki-2026-03-01-p18816202p24038461.json.gz | ||
| + | > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p24038462p29075629.xml.bz2 -o enwiki-2026-03-01-p24038462p29075629.json.gz | ||
| + | > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p29075630p34204620.xml.bz2 -o enwiki-2026-03-01-p29075630p34204620.json.gz | ||
| + | > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p34204621p39293698.xml.bz2 -o enwiki-2026-03-01-p34204621p39293698.json.gz | ||
| + | > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p39293699p43920660.xml.bz2 -o enwiki-2026-03-01-p39293699p43920660.json.gz | ||
| + | > python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p43920661p48725620.xml.bz2 -o enwiki-2026-03-01-p43920661p48725620.json.gz | ||
| + | |||
| + | </code> | ||
| + | |||
| + | ==== One- and Two-Gram Counting ==== | ||
| + | |||