Commonality and Specialty Detection

Citation Information

Shuo Xu, Ling Li, Xin An, Liyuan Hao, and Guancan Yang, 2021. An Approach for Detecting the Commonality and Specialty between Scientific Publications and Patents. Scientometrics, Vol. 126, No. 9, pp. 7445-7475.

Requirements

Create Database

The database SQL file: drug_bank.sql.

Import DrugBank

To download drugbank_all_full_database.xml.zip from DrugBank and save it in the directory data. Before it, one account needs to be created and approved.

To run Importer.java in the package cn.edu.bjut.ui.

Update PMC id and doi

To download PMC-ids-csv.gz, and save it in the directory resource.

To run PmcIdAndDoiUpdater.java in the package cn.edu.bjut.ui.

Data Augmentation with Medline/PubMed Full Text

To download Medline/PubMed Full Text in the XML format.

To extract XML files with ArticleXMLExtractor.java in the package cn.edu.bjut.ui, and save them in the directory data/articles/xml.

To import the related information into the database with ArticleXMLImporter.java in the package cn.edu.bjut.ui from the directory data/articles/xml.

To deal with exceptions with SpecialContributorProcessor.java in the package cn.edu.bjut.ui.

Data Augmentation with E-Fetch API

To extract XML files with ArticleURLExtractor.java in the package cn.edu.bjut.ui, and save them in the directory data/articles/url.

To import the related information into the database with ArticleURLImporter.java in the package cn.edu.bjut.ui from the directory data/articles/url.

Supplement Missing or Incorrect Information

SELECT id, pmc_id, citation, title, abst, doi, publication_year FROM article WHERE pubmed_id IS NULL OR title IS NULL;

To export the above records to articles_missing.xlsx in the directory data, and then correct manually them one by one.

Once correction is done, to run MissingExcelImporter.java in the package cn.edu.bjut.ui to import the related information in the file data/articles_missing.xlsx into MySQL database.

SELECT id, title FROM article WHERE title LIKE "%author's transl%";  
SELECT id, abst FROM article WHERE abst REGEXP "ABSTRACT TRUNCATED AT [0-9]+ WORDS";

To export the above records to articles_updating.xlsx in the directory data, and then correct manually them one by one.

Once correction is done, to run UpdatingExcelImporter.java in the package cn.edu.bjut.ui to import the related information in the file data/articles_updating.xlsx into MySQL database.

Data Augmentation with OPS API

To extract XML files with PatentURLExtractor.java in the package cn.edu.bjut.ui, and save them in the directory data/patents/url.

To import the related information into the database with PatentURLImporter.java in the package cn.edu.bjut.ui from the directory data/patents/url.

To update the country information with PatentOriginalCountryUpdater.java in the package cn.edu.bjut.ui.

Export All Documents in the Format of genia

To run Converter2Genia.java in the package cn.edu.bjut.genia. Thus, the articles and patents will be saved in the directory data/genia/articles and data/genia/patents, respectively. Each article or patent document is named by a prefix “S” or “T” and the resulting id.

Detect and Tokenize Sentences

> ./run_geniass.sh geniass drugbank/articles &
> ./run_geniass.sh geniass drugbank/patents &
> ./run_geniatagger.sh geniatagger drugbank/articles &
> ./run_geniatagger.sh geniatagger drugbank/patents &

For each document, two files will be generated with the extension name .txt.ss and .txt.ss.tag. To save all .txt.ss and .txt.ss.tag files in the directory data/genia/articles for scientific publications and data/genia/patents for patent documents.

Format the Tokenized Content for the HMM-LDA Model

To run Converter2HmmLda.java in the package cn.edu.bjut.genia. The results will be saved in the directory data/hmm-lda, in which there are four types of files with the extension file .corpus, .docs, .vocab and .corpus.tokens. The first three items are for the HMM-LDA model and the last one for the indicator calculation.

Estimate a HMM-LDA Model

To run HMMLDA.java in the package cn.edu.bjut.ui from another project HMM-LDA. If you want, the resulting parameters can be set through a configure file HMMLDA.properties, located in the directory conf.

Syntactic and Lexical Complexity before Filtering Stopwords

To run synatic complexity_Patent.py and synatic complexity_Article.py in the directory indicators/before to calculate syntatic complexity indicators (Title/Abstract/Abstract average sentence Length);

To run Abs_Sen_Complexity_Patent.py and Abs_Sen_Complexity_Article.py in the directory indicators/before to save parsed tree structure, and then with the tool stanford-tregex to calculate sentence complexity;

To run lexical complexity_Patent_Title.py, lexical complexity_Patent_Abs.py and lexical complexity_Article_Title.py, lexical complexity_Article_Abs.py in the directory indicators/before to calculate lexical complexity indicators (Lexical Diversity/ Sophistication /Density);

Syntactic and Lexical Complexity after Filtering Stopwords

To run Mean_synatic_complexity_Patent.py and Mean_synatic_complexity_Article.py in the directory indicators/after to calculate the meaningful synatic complexity indicators (Title/Abstract/Abstract average sentence Length);

To run Mean_lexical_complexity_Patent_Title.py, Mean_lexical complexity_Patent_Abs.py, Mean_lexical complexity_Article_Title.py, and Mean_lexical_complexity_Article_Abs.py in the directory indicators/after to calculate the meaningful lexical complexity indicators (Lexical Diversity/ Sophistication /Density);

Descriptive Statistics and Word Cloud

To run Statistics.py and overlap.py in the directory indicators/before to count the number of (overlapped) tokens and (overlapped) unique words;

To run Mean_Statistics.py, Mean_overlap.py and Non_overlap.py in the directory indicators/after to count the number of (overlapped) tokens and (overlapped) unique words, and save overlapped words with their corresponding word frequencies.

Format Data for the CDTM Model

To run Trans_CDTM.py in the directory CDTM-Test. In this time, a dictionary will be generated with the extension files .word.vocab and ID.csv. Then, with the help of Excel, to generate two documents with the extension name .docs and .corpus.

Estimate a CDTM Model

To run CdtmParameterTuning.java in the package cn.edu.bjut.ui. The perplexity will be obtained for each candidate value combination of the number of common topics, the number of topics specific to scientific publications, and the number of topics specific to patents.

Then, perplexity values are imported to MATLAB software, and to run TuneParam.m. A figure will be shown for the perplexity with different number of topics. By observing this figure, the optimal number of common and special topics will be determined. To run Cdtm.java in the package cn.edu.bjut.ui, the final results will be obtained.

Connections amongst Common and Special Topics

To run NetworkConverter.java in the package cn.edu.bjut.ui. One map file and one network file will be generated. Thus, one can import these two files into the software VOSviewer.

硕风徐徐

侧边栏

目录

Commonality and Specialty Detection

Citation Information

Requirements

Create Database

Import DrugBank

Update PMC id and doi

Data Augmentation with Medline/PubMed Full Text

Data Augmentation with E-Fetch API

Supplement Missing or Incorrect Information

Data Augmentation with OPS API

Export All Documents in the Format of genia

Detect and Tokenize Sentences

Format the Tokenized Content for the HMM-LDA Model

Estimate a HMM-LDA Model

Syntactic and Lexical Complexity before Filtering Stopwords

Syntactic and Lexical Complexity after Filtering Stopwords

Descriptive Statistics and Word Cloud

Format Data for the CDTM Model

Estimate a CDTM Model

Connections amongst Common and Special Topics

评论

硕风徐徐

用户工具

站点工具

侧边栏

目录

Commonality and Specialty Detection

Citation Information

Requirements

Create Database

Import DrugBank

Update PMC id and doi

Data Augmentation with Medline/PubMed Full Text

Data Augmentation with E-Fetch API

Supplement Missing or Incorrect Information

Data Augmentation with OPS API

Export All Documents in the Format of genia

Detect and Tokenize Sentences

Format the Tokenized Content for the HMM-LDA Model

Estimate a HMM-LDA Model

Syntactic and Lexical Complexity before Filtering Stopwords

Syntactic and Lexical Complexity after Filtering Stopwords

Descriptive Statistics and Word Cloud

Format Data for the CDTM Model

Estimate a CDTM Model

Connections amongst Common and Special Topics

评论

页面工具