用户工具

站点工具


zh:notes:math_softwares

Math Software Dataset Construction

Citation Information

Create Database

The database SQL file: math_software.sql.

Download Data

Word Embedding

PMI Calculation

Wikepedia dumps should be downloaded in advance.

Plain Text Extraction

Please refer to gensim.scripts.segment for more detial.

> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p10p1147431.xml.bz2 -o enwiki-2026-03-01-p10p1147431.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p1147434p3987701.xml.bz2 -o enwiki-2026-03-01-p1147434p3987701.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p3987703p8213792.xml.bz2 -o enwiki-2026-03-01-p3987703p8213792.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p8213793p13295371.xml.bz2 -o enwiki-2026-03-01-p8213793p13295371.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p13295373p18816201.xml.bz2 -o enwiki-2026-03-01-p13295373p18816201.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p18816202p24038461.xml.bz2 -o enwiki-2026-03-01-p18816202p24038461.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p24038462p29075629.xml.bz2 -o enwiki-2026-03-01-p24038462p29075629.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p29075630p34204620.xml.bz2 -o enwiki-2026-03-01-p29075630p34204620.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p34204621p39293698.xml.bz2 -o enwiki-2026-03-01-p34204621p39293698.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p39293699p43920660.xml.bz2 -o enwiki-2026-03-01-p39293699p43920660.json.gz
> python -m gensim.scripts.segment_wiki -i -f enwiki-2026-03-01-p43920661p48725620.xml.bz2 -o enwiki-2026-03-01-p43920661p48725620.json.gz

One- and Two-Gram Counting

zh/notes/math_softwares.txt · 最后更改: 2026/03/28 21:47 由 pzczxs