用户工具

站点工具


zh:notes:emerging_topics_detection

Emerging Research Topics Detection

Citation Information

Shuo Xu, Liyuan Hao, Xin An, Guancan Yang, and Feifei Wang, 2019. Emerging Research Topics Detection with Multiple Machine Learning Models. Journal of Informetrics, Vol. 13, No. 4, pp. 100983.

Requirements

Dataset

  • Search stragegy
    • TS = (gene edit*) or TS = (crispr))
    • Language: English
    • Document type: (Article OR Book Review OR Proceedings Paper OR Review)
    • Timespan: 1980-2017
  • Import CSV files into MySQL database with CSVImporter.java, and then delete a duplicate record as follows.
DELETE FROME article WHERE doi = "10.3389/FIMMU.2017.00351";
DELETE FROM article_reference WHERE article_id = "WOS:000398414900001"; 
 
ALTER TABLE article ADD UNIQUE INDEX doi (doi);
  • Correct manually the DOIs of the references, if the resulting records contain two or more dois. For this case, there are 1098 records in total. Then the corrected dois are imported into MySQL database with DoiCorrector.java.
SELECT COUNT(*) FROM reference WHERE flag = 1;
 
DELETE FROM article_reference WHERE reference_id = 391302 OR reference_id = 324471;
DELETE FROM reference WHERE id = 391302 OR id = 324471;
 
-- Multiple records with the same doi.
SELECT * FROM reference AS r WHERE r.doi IN (SELECT doi FROM reference GROUP BY doi HAVING COUNT(*) > 1);
 
SELECT * FROM reference WHERE doi = "10.1056/NEJMOA1606774";  --id = 7625, 8699
UPDATE article_reference SET reference_id = 8699 WHERE reference_id = 7625;
DELETE FROM reference WHERE id = 7625;
 
SELECT * FROM reference WHERE doi = "10.1093/NAR/GKW883";  --id = 4025, 5595
UPDATE article_reference SET reference_id = 4025 WHERE reference_id = 5595;
DELETE FROM reference WHERE id = 5595;
 
SELECT * FROM reference WHERE doi = "10.1111/PBI.12444";  --id = 28450, 28544
UPDATE article_reference SET reference_id = 28450 WHERE reference_id = 28544;
DELETE FROM reference WHERE id = 28544;
 
SELECT * FROM reference WHERE doi = "10.1111/PBI.12448";  --id = 8812, 28542
UPDATE article_reference SET reference_id = 8812 WHERE reference_id = 28542;
DELETE FROM reference WHERE id = 28542;
 
SELECT * FROM reference WHERE doi = "10.1182/BLOOD-2012-02-408591";  --id = 41631, 49144
UPDATE article_reference SET reference_id = 49144 WHERE reference_id = 41631;
DELETE FROM reference WHERE id = 41631;
 
SELECT * FROM reference WHERE doi = "10.1016/J.JHEP.2011.04.007";  --id = 143283, 143294
UPDATE reference SET doi = UPPER("10.1016/j.jhep.2011.01.048") WHERE id = 143283
 
SELECT * FROM reference WHERE doi = "10.1093/NAR/28.5.1092";  --id = 155644, 172639
UPDATE article_reference SET reference_id = 172639 WHERE reference_id = 155644; 
DELETE FROM reference WHERE id = 155644;
 
SELECT * FROM reference WHERE doi = "10.1093/NAR/GKP846";  --id = 215857, 218559
UPDATE article_reference SET reference_id = 215857 WHERE reference_id = 218559; 
DELETE FROM reference WHERE id = 218559; 
 
SELECT * FROM reference WHERE doi = "10.1038/NCOMMS4832";  --id = 31633, 195611
UPDATE article_reference SET reference_id = 31633 WHERE reference_id = 195611; 
DELETE FROM reference WHERE id = 195611; 
 
SELECT * FROM reference WHERE doi = "10.1182/ASHEDUCATION-2009.1.682";  --id = 15976, 222652
UPDATE article_reference SET reference_id = 15976 WHERE reference_id = 222652; 
DELETE FROM reference WHERE id = 222652; 
 
SELECT * FROM reference WHERE doi = "10.1021/ACSSYNBIO.5B00074";  --id = 121810, 325434
UPDATE article_reference SET reference_id = 325434 WHERE reference_id = 121810;
DELETE FROM reference WHERE id = 121810; 
 
SELECT * FROM reference WHERE doi = "10.1002/0471142727.MB0116S78";  --id = 16600, 329023
UPDATE reference SET doi = UPPER("10.1002/0471142727.mb0117s79") WHERE id = 16600
 
SELECT * FROM reference WHERE doi = "10.1002/DVG.22835";  --id = 32722, 119752
UPDATE article_reference SET reference_id = 32722 WHERE reference_id = 119752; 
DELETE FROM reference WHERE id = 119752; 
 
SELECT * FROM reference WHERE doi = "10.1038/NATURE13864";  --id = 13415, 119740
UPDATE article_reference SET reference_id = 13415 WHERE reference_id = 119740; 
DELETE FROM reference WHERE id = 119740; 
 
SELECT * FROM reference WHERE doi = "10.1038/NBT.3081";  --id = 3655, 119758
UPDATE article_reference SET reference_id = 3655 WHERE reference_id = 119758; 
DELETE FROM reference WHERE id = 119758; 
 
SELECT * FROM reference WHERE doi = "10.1038/NBT.3101";  --id = 3853, 119743
UPDATE article_reference SET reference_id = 3853 WHERE reference_id = 119743; 
DELETE FROM reference WHERE id = 119743; 
 
SELECT * FROM reference WHERE doi = "10.1093/BIOINFORMATICS/BTU743";  --id = 893£¬ 115121
UPDATE article_reference SET reference_id = 893 WHERE reference_id = 115121; 
DELETE FROM reference WHERE id = 115121; 
 
SELECT * FROM reference WHERE doi = "10.1093/PCP/PCX034";  --id = 19220, 351788
UPDATE article_reference SET reference_id = 19220 WHERE reference_id = 351788; 
DELETE FROM reference WHERE id = 351788; 
 
SELECT * FROM reference WHERE doi = "10.1111/PBI.12663";  --id = 14090, 351787
UPDATE article_reference SET reference_id = 14090 WHERE reference_id = 351787; 
DELETE FROM reference WHERE id = 351787; 
 
SELECT * FROM reference WHERE doi = "10.1111/TPJ.12338";  --id = 15135, 143729
UPDATE article_reference SET reference_id = 15135 WHERE reference_id = 143729; 
DELETE FROM reference WHERE id = 143729; 
 
SELECT * FROM reference WHERE doi = "10.1128/JVI.01879-14";  --id = 3749, 119745
UPDATE article_reference SET reference_id = 3749 WHERE reference_id = 119745;
DELETE FROM reference WHERE id = 119745; 
 
SELECT * FROM reference WHERE doi = "10.1146/ANNUREV-PHYTO-080508-081936";  --id = 30366, 144245
UPDATE article_reference SET reference_id = 30366 WHERE reference_id = 144245; 
DELETE FROM reference WHERE id = 144245; 
 
SELECT * FROM reference WHERE doi = "10.1182/BLOOD-2010-09-309591";  --id = 13744, 116491
UPDATE article_reference SET reference_id = 13744 WHERE reference_id = 116491; 
DELETE FROM reference WHERE id = 116491; 
 
SELECT * FROM reference WHERE doi = "10.1186/S13059-014-0554-4";  --id = 1138, 115027
UPDATE article_reference SET reference_id = 1138 WHERE reference_id = 115027; 
DELETE FROM reference WHERE id = 115027; 
 
SELECT * FROM reference WHERE doi = "10.1016/J.STR.2013.01.010";  --id = 47298, 164488
UPDATE article_reference SET reference_id = 47298 WHERE reference_id = 164488;
DELETE FROM reference WHERE id = 164488;
 
SELECT * FROM reference WHERE doi = "10.4049/JIMMUNOL.0801143";  --id = 72385, 389538
UPDATE article_reference SET reference_id = 72385 WHERE reference_id = 389538;
DELETE FROM reference WHERE id = 389538;
 
SELECT * FROM reference WHERE doi = "10.1895/WORMBOOK.1.89.1";  --id = 99739, 401115
UPDATE article_reference SET reference_id = 401115 WHERE reference_id = 99739;
DELETE FROM reference WHERE id = 99739;
 
SELECT * FROM reference WHERE doi = "10.1002/0471142727.MB0117S79";  --id = 16600, 58341
UPDATE article_reference SET reference_id = 16600 WHERE reference_id = 58341;
DELETE FROM reference WHERE id = 58341;
 
SELECT * FROM reference WHERE doi = "10.1002/0471250953.BI0206S21";  --id = 336465, 370927
UPDATE article_reference SET reference_id = 336465 WHERE reference_id = 370927;
DELETE FROM reference WHERE id = 370927;
 
SELECT * FROM reference WHERE doi = "10.1002/BIOT.201400046";  --id = 3644, 65728
UPDATE article_reference SET reference_id = 3644 WHERE reference_id = 65728;
DELETE FROM reference WHERE id = 65728;
 
SELECT * FROM reference WHERE doi = "10.1016/J.ACA.2015.11.023";  --id = 342693, 392793
UPDATE article_reference SET reference_id = 392793 WHERE reference_id = 342693;
DELETE FROM reference WHERE id = 342693;
 
SELECT * FROM reference WHERE doi = "10.1016/J.JMB.2013.02.032";  --id = 10972, 162734
UPDATE article_reference SET reference_id = 10972 WHERE reference_id = 162734;
DELETE FROM reference WHERE id = 162734;
 
SELECT * FROM reference WHERE doi = "10.1038/NBT1418";  --id = 64996, 345646
UPDATE article_reference SET reference_id = 64996 WHERE reference_id = 345646;
DELETE FROM reference WHERE id = 345646;
 
SELECT * FROM reference WHERE doi = "10.1038/ONC.2015.469";  --id = 4766, 52235
UPDATE article_reference SET reference_id = 4766 WHERE reference_id = 52235;
DELETE FROM reference WHERE id = 52235;
 
SELECT * FROM reference WHERE doi = "10.1038/ONC.2015.477";  --id = 352302, 385213
UPDATE article_reference SET reference_id = 385213 WHERE reference_id = 352302;
DELETE FROM reference WHERE id = 352302;
 
SELECT * FROM reference WHERE doi = "10.1038/SREP05396";  --id = 14281, 46957
UPDATE article_reference SET reference_id = 14281 WHERE reference_id = 46957;
DELETE FROM reference WHERE id = 46957;
 
SELECT * FROM reference WHERE doi = "10.1073/PNAS.1121465109";  --id = 1244, 155642
UPDATE article_reference SET reference_id = 1244 WHERE reference_id = 155642;
DELETE FROM reference WHERE id = 155642;
 
SELECT * FROM reference WHERE doi = "10.1093/BIOINFORMATICS/BTU153";  --id = 2706, 357942
UPDATE article_reference SET reference_id = 2706 WHERE reference_id = 357942;
DELETE FROM reference WHERE id = 357942;
 
SELECT * FROM reference WHERE doi = "10.1093/NAR/GKU1130";  --id = 66224, 117403
UPDATE article_reference SET reference_id = 117403 WHERE reference_id = 66224;
DELETE FROM reference WHERE id = 66224;
 
SELECT * FROM reference WHERE doi = "10.1093/NAR/GKU1326";  --id = 57023, 119747
UPDATE article_reference SET reference_id = 57023 WHERE reference_id = 119747;
DELETE FROM reference WHERE id = 119747;
 
SELECT * FROM reference WHERE doi = "10.1101/GR.229202";  --id = 2184, 75153
UPDATE article_reference SET reference_id = 2184 WHERE reference_id = 75153;
DELETE FROM reference WHERE id = 75153;
 
SELECT * FROM reference WHERE doi = "10.1111/JNC.12198";  --id = 120710, 162241
UPDATE article_reference SET reference_id = 120710 WHERE reference_id = 162241;
DELETE FROM reference WHERE id = 162241;
 
SELECT * FROM reference WHERE doi = "10.1126/science.1258699";  --id = 63031, 114526
UPDATE article_reference SET reference_id = 63031 WHERE reference_id = 114526;
DELETE FROM reference WHERE id = 114526;
 
SELECT * FROM reference WHERE doi = "10.1128/mbio.01068-14";  --id = 6832, 364420
UPDATE article_reference SET reference_id = 6832 WHERE reference_id = 364420;
DELETE FROM reference WHERE id = 364420;
 
SELECT * FROM reference WHERE doi = "10.1146/ANNUREV-GENOM-091212-153435";  --id = 11949, 163158
UPDATE article_reference SET reference_id = 11949 WHERE reference_id = 163158;
DELETE FROM reference WHERE id = 163158;
 
SELECT * FROM reference WHERE doi = "10.1146/ANNUREV-MICRO-092412-155633";  --id = 9911, 38661
UPDATE article_reference SET reference_id = 9911 WHERE reference_id = 38661;
DELETE FROM reference WHERE id = 38661;
 
SELECT * FROM reference WHERE doi = "10.1186/1471-2105-15-293";  --id = 1350, 188399
UPDATE article_reference SET reference_id = 188399 WHERE reference_id = 1350;
DELETE FROM reference WHERE id = 1350;
 
SELECT * FROM reference WHERE doi = "10.1186/1471-2105-9-11";  --id = 22829, 390101
UPDATE article_reference SET reference_id = 22829 WHERE reference_id = 390101;
DELETE FROM reference WHERE id = 390101;
 
SELECT * FROM reference WHERE doi = "10.1146/ANNUREV.BIOCHEM.77.061306.125255";  --id = 951, 110621
UPDATE article_reference SET reference_id = 110621 WHERE reference_id = 951;
DELETE FROM reference WHERE id = 951;
 
SELECT * FROM reference WHERE doi = "10.1038/NRMICRO3569";  --id = 1039, 85667
UPDATE article_reference SET reference_id = 1039 WHERE reference_id = 85667;
DELETE FROM reference WHERE id = 85667;
 
SELECT * FROM reference WHERE doi = "10.1038/MT.2016.1";  --id = 1732, 408166
UPDATE article_reference SET reference_id = 1732 WHERE reference_id = 408166;
DELETE FROM reference WHERE id = 408166;
 
SELECT * FROM reference WHERE doi = "10.1371/JOURNAL.PONE.0047232";  --id = 3410, 51811
UPDATE article_reference SET reference_id = 3410 WHERE reference_id = 51811;
DELETE FROM reference WHERE id = 51811;
 
SELECT * FROM reference WHERE doi = "10.1021/ACSSYNBIO.5B00007";   --id = 4468, 79259
UPDATE article_reference SET reference_id = 4468 WHERE reference_id = 79259;
DELETE FROM reference WHERE id = 79259;
 
SELECT * FROM reference WHERE doi = "10.1371/JOURNAL.PONE.0124633";   --id = 5492, 414191
UPDATE article_reference SET reference_id = 5492 WHERE reference_id = 414191;
DELETE FROM reference WHERE id = 414191;
 
SELECT * FROM reference WHERE doi = "10.1038/ONC.2014.470";   --id = 7402, 94276
UPDATE article_reference SET reference_id = 7402 WHERE reference_id = 94276;
DELETE FROM reference WHERE id = 94276;
 
SELECT * FROM reference WHERE doi = "10.1146/ANNUREV.BIOCHEM.052308.093131";   --id = 7920, 81099
UPDATE article_reference SET reference_id = 7920 WHERE reference_id = 81099;
DELETE FROM reference WHERE id = 81099;
 
SELECT * FROM reference WHERE doi = "10.1016/J.MOLP.2015.02.011";   --id = 19210, 90506
UPDATE article_reference SET reference_id = 19210 WHERE reference_id = 90506;
DELETE FROM reference WHERE id = 90506;
 
SELECT * FROM reference WHERE doi = "10.1016/J.MOLP.2015.04.001";   --id = 28195, 98913
UPDATE article_reference SET reference_id = 28195 WHERE reference_id = 98913;
DELETE FROM reference WHERE id = 98913;
 
SELECT * FROM reference WHERE doi = "10.1111/PBI.12483";   --id = 28250, 414661
UPDATE article_reference SET reference_id = 28250 WHERE reference_id = 414661;
DELETE FROM reference WHERE id = 414661;
 
SELECT * FROM reference WHERE doi = "10.1111/PBI.12459";   --id = 28258, 414662
UPDATE article_reference SET reference_id = 28258 WHERE reference_id = 414662;
DELETE FROM reference WHERE id = 414662;
 
SELECT * FROM reference WHERE doi = "10.1101/PDB.PROT4668";   --id = 28554, 99811
DELETE FROM article_reference WHERE reference_id = 99811; 
UPDATE article_reference SET reference_id = 99811 WHERE reference_id = 28554;
DELETE FROM reference WHERE id = 28554;
 
SELECT * FROM reference WHERE doi = "10.1038/NCOMMS3503";   --id = 32801, 87241
UPDATE article_reference SET reference_id = 32801 WHERE reference_id = 87241;
DELETE FROM reference WHERE id = 87241;
 
SELECT * FROM reference WHERE doi = "10.1016/J.MOLP.2015.02.012";   --id = 45207, 98915
UPDATE article_reference SET reference_id = 45207 WHERE reference_id = 98915;
DELETE FROM reference WHERE id = 98915;
 
SELECT * FROM reference WHERE doi = "10.1002/0471143030.CB1912S44";   --id = 63782, 90082
UPDATE article_reference SET reference_id = 90082 WHERE reference_id = 63782;
DELETE FROM reference WHERE id = 63782;
 
SELECT * FROM reference WHERE doi = "10.1002/JOR.22745";   --id = 73362, 101588
UPDATE article_reference SET reference_id = 73362 WHERE reference_id = 101588;
DELETE FROM reference WHERE id = 101588;
 
SELECT * FROM reference WHERE doi = "10.1128/MBIO.00869-14";   --id = 84751, 353843
UPDATE article_reference SET reference_id = 353843 WHERE reference_id = 84751;
DELETE FROM reference WHERE id = 84751;
 
SELECT * FROM reference WHERE doi = "10.1126/SCIENCE.282.5396.2012";   --id = 110922, 383865
UPDATE article_reference SET reference_id = 110922 WHERE reference_id = 383865;
DELETE FROM reference WHERE id = 383865;
 
SELECT * FROM reference WHERE doi = "10.1186/1471-2180-11-12";   --id = 143645, 178288
UPDATE reference SET doi = UPPER("10.1186/1471-2180-11-102") WHERE id = 143645;
 
SELECT * FROM reference WHERE doi = "10.1186/1471-2180-11-102";   --id = 68810, 143645
UPDATE article_reference SET reference_id = 143645 WHERE reference_id = 68810;
DELETE FROM reference WHERE id = 68810;
 
SELECT * FROM reference WHERE doi = "10.1105/TPC.108.062018";   --id = 205882, 412957
UPDATE article_reference SET reference_id = 412957 WHERE reference_id = 205882;
DELETE FROM reference WHERE id = 205882;
 
SELECT * FROM reference WHERE doi = "10.3389/FCIMB.2016.00031";   --id = 319969, 409430
UPDATE article_reference SET reference_id = 409430 WHERE reference_id = 319969;
DELETE FROM reference WHERE id = 319969;
 
UPDATE article_reference SET reference_id = 14562 WHERE reference_id = 174753;
DELETE FROM reference WHERE id = 174753; 
 
UPDATE article_reference SET reference_id = 128110 WHERE reference_id = 141709; 
DELETE FROM reference WHERE id = 141709; 
 
UPDATE article_reference SET reference_id = 196332 WHERE reference_id = 85294; 
DELETE FROM reference WHERE id = 85294; 
 
UPDATE article_reference SET reference_id = 18740 WHERE reference_id = 174995;
DELETE FROM reference WHERE id = 174995;
 
UPDATE article_reference SET reference_id = 106976 WHERE reference_id = 35109; 
DELETE FROM reference WHERE id = 35109;
 
UPDATE article_reference SET reference_id = 218880 WHERE reference_id = 36325; 
DELETE FROM reference WHERE id = 36325;
 
UPDATE article_reference SET reference_id = 384149 WHERE reference_id = 152320; 
DELETE FROM reference WHERE id = 152320;
 
UPDATE article_reference SET reference_id = 128443 WHERE reference_id = 143146; 
DELETE FROM reference WHERE id = 143146;
 
UPDATE article_reference SET reference_id = 171985 WHERE reference_id = 139223; 
DELETE FROM reference WHERE id = 139223;
 
UPDATE article_reference SET reference_id = 121244 WHERE reference_id = 98415; 
DELETE FROM reference WHERE id = 98415;
 
UPDATE article_reference SET reference_id = 186872 WHERE reference_id = 81683; 
DELETE FROM reference WHERE id = 81683;
 
UPDATE article_reference SET reference_id = 346372 WHERE reference_id = 375437; 
DELETE FROM reference WHERE id = 375437;
 
UPDATE article_reference SET reference_id = 83423 WHERE reference_id = 74597; 
DELETE FROM reference WHERE id = 74597;
 
UPDATE reference SET doi = "10.4172/2155-9899.1000264" WHERE id = 101055; 
 
DELETE FROM article_reference WHERE reference_id = 151115 OR reference_id = 262589 OR reference_id = 310078;
DELETE FROM reference WHERE id = 151115 OR id = 262589 OR id = 310078;
 
ALTER TABLE reference ADD UNIQUE INDEX doi (doi); 
  • Extract open access XML file names from the reference table, oa_file_list.csv and PMC-ids.csv.gz with DownloadConverter.java, and then download automatically them as follows.
wget -b -i download.txt
  • Extract the titles and abstracts from the downloaded XML files, and save them to MySQL database with XMLExtractor.java.
  • BibTeXConverter.java
  • Generate a unique wos_id:
UPDATE reference SET wos_id = CONCAT('SXU:', LPAD(id, 15, '0')) WHERE wos_id IS NULL;

Detect and Tokenize Sentences, and Recognize Entities

To run Converter2Genia.java in the package cn.edu.bjut.genia of the project EmergingTopicsConverter. Thus, the articles will be saved in the directories data/contest-Genia/DIM and data/contest-Genia/CIM. Each article is named by its resulting id.

> ./run_geniass.sh geniass data/contest-Genia/DIM &
> ./run_geniatagger.sh geniatagger data/contest-Genia/DIM &
> ./run_geniass.sh geniass data/contest-Genia/CIM &
> ./run_geniatagger.sh geniatagger data/contest-Genia/CIM &

For each document, two files will be generated with the extension name .txt.ss and .txt.ss.tag. To save all .txt.ss and .txt.ss.tag files in the directories data/contest-Genia/DIM and data/contest-Genia/CIM.

Run the DIM Model

To run Converter2DIM.java in the package cn.edu.bjut.genia of the project EmergingTopicsConverter. Several files will be generated for the DIM model in the directory data/contest-DIM/emergence.

Format Convertation

> java -jar BibTeXConverter.jar -inputDir ../data\bibtex -outputDir ../data -prefix LIS
> python convert_to_dtm.py --input_dir ../data/CorpusByYear/ --preprocessing wordlen --min_df 5 --output_dir ../data --prefix LIS

DTM & DIM

> ./main --ntopics=20 --mode=fit --rng_seed=0 --initialize_lda=true --corpus_prefix=data/LIS --outname=data/dtm-run --top_chain_var=0.005 --alpha=0.01 --lda_sequence_min_iter=6 --lda_sequence_max_iter=20 --lda_max_em_iter=10
> ./main --ntopics=20 --mode=fit --rng_seed=0 --initialize_lda=true --corpus_prefix=data/LIS --outname=data/dim-run --top_chain_var=0.005 --model=fixed --time_resolution=2 --influence_flat_years=5 --top_obs_var=0.5 --sigma_d=0.0001 --sigma_l=0.0001 --alpha=0.01 --lda_sequence_min_iter=6 --lda_sequence_max_iter=20 --save_time=-1 --lda_max_em_iter=10
zh/notes/emerging_topics_detection.txt · 最后更改: 2022/01/28 09:12 由 pzczxs