The database SQL file: electric_power.sql.
Project: WoSImporter.
One can preprocess all files in BibTeX format by running BibTeXPreprocessor.java in the package cn.edu.bjut.ui.
> nohup preprocess-wos.sh ../dataset/WoS/papers > preprocess-wos.log 2>&1
The target articles can be imported to the database by running ArticleBibTexImporter.java in the package cn.edu.bjut.ui. It is noteworthy that the parameters checkFlag and citedArticleFlag should be set to false. In this procedure, the DOI names of cited articles are pre-processed with the cleaning method in Xu et al. (2019). After then, exploratory analysis on the data quality can be conducted, especially DOIs of articles and cited articles. Once incorrect information is found, one can manually correct it. For example, one can update the resulting DOI names of two articles by running the following statements:
>
Update publication year:
> SELECT id, wos_id, title, doi, publication_year FROM article WHERE publication_year = 0 ORDER BY doi ASC;
> nohup ./import-wos.sh 0 ../dataset/WoS/papers > import-wos.log 2>&1
The cited times can be imported by running CitedTimesImporter.java in the package cn.edu.bjut.ui.
> nohup ./import-article-cited-times.sh > import-article-cited-times.log 2>&1
The target articles can be merged by running ArticleMerger.java in the package cn.edu.bjut.ui according to the resulting DOI names.
> nohup ./merge-article.sh > merge-article.log 2>&1
The authors can be merged by running AuthorMerger.java in the package cn.edu.bjut.ui according to ResearcherID, ORCID and Email.
> nohup ./merge-author.sh > merge-author.log 2>&1
The journals can be merged by running JournalMerger.java in the package cn.edu.bjut.ui according to ISSN, EISSN and ISBN.
> nohup ./merge-journal.sh > merge-journal.log 2>&1
The fundings can be merged by running FundingMerger.java in the package cn.edu.bjut.ui according to grant number.
> nohup ./merge-funding.sh > merge-funding.log 2>&1
KeywordUpdater.java
> nohup ./import-wos-keyword.sh 0 ../dataset/WoS/papers > import-wos-keyword.log 2>&1
> SELECT id, doi FROM cited_article WHERE doi LIKE "%PUBMED%" AND flag = 0 AND journal_id IS NULL INTO OUTFILE "/var/lib/mysql-files/doi-errors.txt";
The cited articles with multiple DOI names can be resolved by running CitedArticleMultipleDoiResolver.java in the package cn.edu.bjut.doi. Note that this operation needs to access the DOI parser.
> nohup ./resolve-cited-article-with-multiple-dois.sh > resolve-cited-article-with-multiple-dois.log 2>&1
The cited articles with the DOI names of non-preprint, preprint and dataset can be split by running CitedArticleMultipleDoiSplitter in the package cn.edu.bjut.doi.
> nohup ./split-cited-article-with-multiple-dois.sh > split-cited-article-with-multiple-dois.log 2>&1
The cited articles can be merged by running CitedArticleMerger.java in the package cn.edu.bjut.ui according to the resulting DOI names.
> nohup ./merge-cited-article.sh > merge-cited-article.log 2>&1
The information related to the cited articles can be updated from the resulting target ones by running CitedArticleUpdaterWithArticle.java in the package cn.edu.bjut.ui. The updated information includes title, abstract, publication year, type, journal, keyword, keyword plus, category, research area and so on.
> nohup ./update-cited-article-with-article.sh > update-cited-article-with-article.log 2>&1
The DOI names for cited articles are randomly divided into six groups by running CitedArticleDoiExtractor.java in the package cn.edu.bjut.doi. In this time, six files (cited-articles-$i$.doi, $i \in \{1, 2, \cdots, 6\}$) will be generated in the directory data/doi_group.
> nohup ./group-cited-article-doi.sh 807364 4 ../dataset/doi/20231231/cited-articles- > doi-grouper.log 2>&1
The cited articles can be imported by running ArticleBibTexImporter.java in the package cn.edu.bjut.ui. It is noteworthy that the parameters checkFlag and citedArticleFlag are set to false and true respectively.
> nohup ./import-wos.sh 1 ../dataset/WoS/cited_papers > import-wos.log 2>&1
The journals can be merged by running JournalMerger.java in the package cn.edu.bjut.ui according to ISSN, EISSN and ISBN.
> SELECT code, name, name_cn FROM country WHERE name IS NOT NULL ORDER BY code ASC;
An anonymous author usually appear in our author table. In our case, it is named as “[Anonymous]” with id = 7982. One can remove the relations between this author and the resulting articles by running the following SQL statement.
> DELETE FROM article_author WHERE author_id = 7982;
In addition, “Lars Stemmann Fabien Lombard” (id = 1480925) actually represent two different authors “Stemmann, Lars” (id = 623081) and “Lombard, Fabien” (608135). One can correct it by running the following SQL statements.
> DELETE FROM article_author WHERE article_id = 628802 AND author_id = 1480925; > INSERT article_author (article_id, author_id, seq_no, is_reprint) VALUES (628802, 623081, 20, 0); > INSERT article_author (article_id, author_id, seq_no, is_reprint) VALUES (628802, 608135, 21, 0);
> SELECT id, full_name, first_name, last_name FROM author WHERE last_name IS NULL ORDER BY full_name ASC;
> SELECT id, title, doi, publication_year FROM article WHERE id IN (SELECT DISTINCT article_id FROM article_author WHERE author_id = ?);
There are many references with the type of patents in our cited articles. One can retrieve them with the following SQL statement, and then check them one by one.
> SELECT id, preferred_id, text FROM cited_article WHERE text LIKE "%patent%" INTO OUTFILE "/var/lib/mysql-files/cited_patents_from_articles.csv";
One can run ArticleTechnologyUpdater.java in the package cn.edu.bjut.ui to update the technologies.
> nohup ./update-article-technology.sh 0 > update-article-technology.log 2>&1
Project: DerwentImporter.
The target patents can be imported to the database by running PatentImporter.java in the package cn.edu.bjut.ui. It is noteworthy that the parameter flag should be set to true. Then, one can run Preprocessor.java in the package cn.edu.bjut.ui to pre-process the resulting abstracts.
> nohup ./import-derwent.sh > import-derwent.log 2>&1 > nohup ./preprocess-derwent.sh > preprocess-derwent.log 2>&1
The cited patent ids can be updated by running PatentCitedPatentUpdater.java in the package cn.edu.bjut.ui, and the log is saved in the file data/patent_cited_patent.log. Then, one can run PatentCitedPatentUpdaterByLog.java to import the related information from the saved log file.
> nohup ./update-patent-cited-patent.sh > patent-cited-patent.log 2>&1 > nohup ./update-patent-cited-patent-by-log.sh > update-patent-cited-patent-by-log.log 2>&1
The patent numbers for cited patents are randomly divided into two groups by running CitedPatentNoGrouper.java in the package cn.edu.bjut.ui. In this time, two files (cited-patents-$i$.txt, $i \in \{1, 2\}$) will be generated in the directory data/patent_no_group.
> nohup ./group-cited-patent-no.sh 4 ../dataset/patent_no/20240327/cited-patent- > group-cited-patent-no.log 2>&1
The target patents can be imported to the database by running PatentImporter.java in the package cn.edu.bjut.ui. It is noteworthy that the parameter flag should be set to false. Then, one can run Preprocessor.java in the package cn.edu.bjut.ui to pre-process the resulting abstracts.
> nohup ./import-derwent.sh > derwent-import.log 2>&1 > nohup ./preprocess-derwent.sh > preprocessor-derwent-cited.log 2>&1
To speed up, publication year for each patent can be updated by running PublicationYearUpdater.java in the package cn.edu.bjut.ui.
> nohup ./update-patent-publication-year.sh > update-patent-publication-year.log 2>&1
The relations between patent and country can be updated by running PatentCountryUpdater.java in the package cn.edu.bjut.ui.
> nohup ./update-patent-country.sh > update-patent-country.log 2>&1
The PCT patents can be determined by running PctFlagUpdater.java in the package cn.edu.bjut.ui.
> nohup ./update-pct-flag.sh > update-pct-flag.log 2>&1
One can run PatentTechnologyUpdater.java in the package cn.edu.bjut.ui to update the technologies.
> nohup ./update-patent-technology.sh 2 ../dataset/Derwent/technologies > update-patent-technology.log 2>&1
Project: ElectricPowerConvertor.
The keywords with brackets should be extracted with the following SQL statement, and the resulting abbreviations are saved into the file abbreviations.xlsx.
> SELECT id, name FROM keyword WHERE id > 784519 AND (name LIKE "%(%" OR name LIKE "%)%") INTO OUTFILE "/var/lib/mysql-files/keyword2020721.csv";
The target articles and patents can be exported by running ToTextConvertor.java in the package cn.edu.bjut.converter. Then, one can extract the abbreviations from the titles and abstracts with the approach in Schwartz and Hearst (2003) by running AbbreviationExtractor.java in the package cn.edu.bjut.ui.
> nohup ./convert-wos-text.sh > convert-wos-text.log 2>&1 > nohup ./convert-derwent-text.sh > convert-derwent-text.log 2>&1
The target articles/patents can be exported for TNG (Topic N-Gram) model by running ToTNGConvertor.java in the package cn.edu.bjut.converter.
> nohup ./convert-wos-bioc.sh > convert-wos-bioc.log 2>&1 > nohup ./convert-derwent-bioc.sh > convert-derwent-bioc.log 2>&1 > nohup ./convert-tng.sh > convert-tng.log 2>&1
The resulting journal of each cited article can be exported by running NoveltyConvertor.java in the package cn.edu.bjut.runner. In the meanwhile, the resulting IPC codes of each cited patent can also be exported by running NoveltyConvertor.java in the package cn.edu.bjut.runner. Then, the novelty indicator of each document can be calculated by following Uzzi et al. (2013).
> nohup ./convert-wos-novelty.sh ../dataset/WoS/novelty/20240421/paper-citations.txt > convert-wos-novelty.log 2>&1 > nohup ./convert-derwent-novelty.sh ../dataset/Derwent/novelty/20240421/patent-citations.txt > convert-derwent-novelty.log 2>&1
> TRUNCATE article_novelty; > TRUNCATE patent_novelty;
> nohup ./import-article-novelty.sh ../dataset/WoS/novelty/20240421/paper_novelty.txt > import-article-novelty.log 2>&1 > nohup ./import-patent-novelty.sh ../dataset/Derwent/novelty/20240421/patent_novelty.txt > import-patent-novelty.log 2>&1
The citation network for articles can be extracted by running WoSCitationNetworkExtractor.java in the package cn.edu.bjut.converter.
The citation network for patents can be extracted by running DerwentCitationNetworkExtractor.java in the package cn.edu.bjut.converter.
The citation network for articles and patents can be extracted by running BothCitationNetworkExtractor.java in the package cn.edu.bjut.converter.
Project: TopicalNGramsModel.
The term-based topics (Xu et al., 2021) can be discovered by running TopicalNGrams.java in the package cn.edu.bjut.ui. The top ngrams for each topic will be re-ranked by the Term Frequency-Inverse Document Frequency (TF-IDF) (Hisamitsu et al, 1999; Meyers et al., 2018), Document Relevance Document Consensus (DRDC) (Navigli and Velardi, 2004; Meyers et al., 2018), and Kullback-Leibler Divergence (KLD) (Meyers et al., 2018).
Project: TechEmergenceIndicators.
One can run IndicatorCalculator.java in the package cn.edu.bjut.ui to calculate all indicators (Xu et al., 2021).
> .\opennlp TokenNameFinderTrainer.brat -nameTypes ATTRIBUTE,VALUE -lang en -model en-ElectronicPower-WoS-attributes.bin -annotationConfig annotation.conf -br atDataDir ElectronicPower-WoS-Train -ruleBasedTokenizer simple -sentenceDetectorMode en-sent.bin