Data Construction for Electric Power Domain

Requirements

Data Sources

Web of Science: Scientific Publications in English
PatStat: Patents in English

Create Database

The database SQL file: electric_power.sql.

Web of Science

Project: WoSImporter.

Download Target Articles

Time Span：From 2015-01-01 to 2023-12-31
#of students：5
Export Format：BibTeX
Record Content：Full Record and Cited References
Search Strategy：

Preprocess Target Articles

One can preprocess all files in BibTeX format by running BibTeXPreprocessor.java in the package cn.edu.bjut.ui.

> nohup preprocess-wos.sh ../dataset/WoS/papers > preprocess-wos.log 2>&1

Import Target Articles

The target articles can be imported to the database by running ArticleBibTexImporter.java in the package cn.edu.bjut.ui. It is noteworthy that the parameters checkFlag and citedArticleFlag should be set to false. In this procedure, the DOI names of cited articles are pre-processed with the cleaning method in Xu et al. (2019). After then, exploratory analysis on the data quality can be conducted, especially DOIs of articles and cited articles. Once incorrect information is found, one can manually correct it. For example, one can update the resulting DOI names of two articles by running the following statements:

Update publication year:

> SELECT id, wos_id, title, doi, publication_year FROM article WHERE publication_year = 0 ORDER BY doi ASC;

> nohup ./import-wos.sh 0 ../dataset/WoS/papers > import-wos.log 2>&1

Import Cited Times

The cited times can be imported by running CitedTimesImporter.java in the package cn.edu.bjut.ui.

> nohup ./import-article-cited-times.sh > import-article-cited-times.log 2>&1

Merge Target Articles, Authors, Journals and Fundings

The target articles can be merged by running ArticleMerger.java in the package cn.edu.bjut.ui according to the resulting DOI names.

> nohup ./merge-article.sh > merge-article.log 2>&1

The authors can be merged by running AuthorMerger.java in the package cn.edu.bjut.ui according to ResearcherID, ORCID and Email.

> nohup ./merge-author.sh > merge-author.log 2>&1

The journals can be merged by running JournalMerger.java in the package cn.edu.bjut.ui according to ISSN, EISSN and ISBN.

> nohup ./merge-journal.sh > merge-journal.log 2>&1

The fundings can be merged by running FundingMerger.java in the package cn.edu.bjut.ui according to grant number.

> nohup ./merge-funding.sh > merge-funding.log 2>&1

Update Keyword

KeywordUpdater.java

> nohup ./import-wos-keyword.sh ../dataset/WoS/papers > import-wos-keyword.log 2>&1

Merge Cited Articles

> SELECT id, doi FROM cited_article WHERE doi LIKE "%PUBMED%" AND flag = 0 AND journal_id IS NULL INTO OUTFILE "/var/lib/mysql-files/doi-errors.txt";

The cited articles with multiple DOI names can be resolved by running CitedArticleMultipleDoiResolver.java in the package cn.edu.bjut.doi. Note that this operation needs to access the DOI parser.

>  nohup ./resolve-cited-article-with-multiple-dois.sh > resolve-cited-article-with-multiple-dois.log 2>&1

The cited articles with the DOI names of non-preprint, preprint and dataset can be split by running CitedArticleMultipleDoiSplitter in the package cn.edu.bjut.doi.

> nohup ./split-cited-article-with-multiple-dois.sh > split-cited-article-with-multiple-dois.log 2>&1

The cited articles can be merged by running CitedArticleMerger.java in the package cn.edu.bjut.ui according to the resulting DOI names.

> nohup ./merge-cited-article.sh > merge-cited-article.log 2>&1

Update Cited Articles with Target Ones

The information related to the cited articles can be updated from the resulting target ones by running CitedArticleUpdaterWithArticle.java in the package cn.edu.bjut.ui. The updated information includes title, abstract, publication year, type, journal, keyword, keyword plus, category, research area and so on.

> nohup ./update-cited-article-with-article.sh > update-cited-article-with-article.log 2>&1

Download Cited Articles

The DOI names for cited articles are randomly divided into six groups by running CitedArticleDoiExtractor.java in the package cn.edu.bjut.doi. In this time, six files (cited-articles-$i$.doi, $i \in \{1, 2, \cdots, 6\}$) will be generated in the directory data/doi_group.

> nohup ./group-cited-article-doi.sh  807364 4 ../dataset/doi/20231231/cited-articles- > doi-grouper.log 2>&1

Import Cited Articles

The cited articles can be imported by running ArticleBibTexImporter.java in the package cn.edu.bjut.ui. It is noteworthy that the parameters checkFlag and citedArticleFlag are set to false and true respectively.

> nohup ./import-wos.sh 1 ../dataset/WoS/cited_papers > import-wos.log 2>&1

Merge Journals

The journals can be merged by running JournalMerger.java in the package cn.edu.bjut.ui according to ISSN, EISSN and ISBN.

> nohup ./merge-journal.sh > merge-journal.log 2>&1

Update Country

> SELECT code, name, name_cn FROM country WHERE name IS NOT NULL ORDER BY code ASC;

Update Author

An anonymous author usually appear in our author table. In our case, it is named as “[Anonymous]” with id = 7982. One can remove the relations between this author and the resulting articles by running the following SQL statement.

> DELETE FROM article_author WHERE author_id = 7982;

In addition, “Lars Stemmann Fabien Lombard” (id = 1480925) actually represent two different authors “Stemmann, Lars” (id = 623081) and “Lombard, Fabien” (608135). One can correct it by running the following SQL statements.

> DELETE FROM article_author WHERE article_id = 628802  AND author_id = 1480925; 
> INSERT article_author (article_id, author_id, seq_no, is_reprint) VALUES (628802, 623081, 20, 0); 
> INSERT article_author (article_id, author_id, seq_no, is_reprint) VALUES (628802, 608135, 21, 0);

> SELECT id, full_name, first_name, last_name FROM author WHERE last_name IS NULL ORDER BY full_name ASC;

> SELECT id, title, doi, publication_year FROM article WHERE id IN (SELECT DISTINCT article_id FROM article_author WHERE author_id = ?);

Cited Patents

There are many references with the type of patents in our cited articles. One can retrieve them with the following SQL statement, and then check them one by one.

> SELECT id, preferred_id, text FROM cited_article WHERE text LIKE "%patent%" INTO OUTFILE "/var/lib/mysql-files/cited_patents_from_articles.csv";

Update Technologies

One can run ArticleTechnologyUpdater.java in the package cn.edu.bjut.ui to update the technologies.

> nohup ./update-article-technology.sh 0 > update-article-technology.log 2>&1

Derwent Innovation Index

Project: DerwentImporter.

Search Strategy

Time Span：From 2015-01-01 to 2023-12-31
IPC Codes：“H02B*” OR “H02G*” OR “H02H*” OR “H02J*” OR “H02K*” OR “H02M*” OR “H02N*” OR “H02P*” OR “H02S*”
Relevant Terms：“CO2 EMISSION” OR “CO2 EMISSIONS” OR “CARBON EMISSION” OR “CARBON EMISSIONS” OR “CARBON DIOXIDE EMISSION” OR “CARBON DIOXIDE EMISSIONS” OR “CARBON NEUTRAL” OR “CARBON NEUTRALITY” OR “CARBON PEAK” OR “CARBON PEAKING” OR “CARBON MITIGATION” OR “CO2 MITIGATION” OR “LOW-CARBON” OR “LOW CARBON” OR “DECARBONIZED POWER SYSTEMS”

Import Target Patents

The target patents can be imported to the database by running PatentImporter.java in the package cn.edu.bjut.ui. It is noteworthy that the parameter flag should be set to true. Then, one can run Preprocessor.java in the package cn.edu.bjut.ui to pre-process the resulting abstracts.

> nohup ./import-derwent.sh > import-derwent.log 2>&1
> nohup ./preprocess-derwent.sh > preprocess-derwent.log 2>&1

Update Cited Patents with Target Ones

The cited patent ids can be updated by running PatentCitedPatentUpdater.java in the package cn.edu.bjut.ui, and the log is saved in the file data/patent_cited_patent.log. Then, one can run PatentCitedPatentUpdaterByLog.java to import the related information from the saved log file.

> nohup ./update-patent-cited-patent.sh > patent-cited-patent.log 2>&1
> nohup ./update-patent-cited-patent-by-log.sh > update-patent-cited-patent-by-log.log 2>&1

Download Cited Patents

The patent numbers for cited patents are randomly divided into two groups by running CitedPatentNoGrouper.java in the package cn.edu.bjut.ui. In this time, two files (cited-patents-$i$.txt, $i \in \{1, 2\}$) will be generated in the directory data/patent_no_group.

> nohup ./group-cited-patent-no.sh 4 ../dataset/patent_no/20240327/cited-patent- > group-cited-patent-no.log 2>&1

Import Cited Patents

The target patents can be imported to the database by running PatentImporter.java in the package cn.edu.bjut.ui. It is noteworthy that the parameter flag should be set to false. Then, one can run Preprocessor.java in the package cn.edu.bjut.ui to pre-process the resulting abstracts.

> nohup ./import-derwent.sh > derwent-import.log 2>&1
> nohup ./preprocess-derwent.sh > preprocessor-derwent-cited.log 2>&1

Update Publication Year

To speed up, publication year for each patent can be updated by running PublicationYearUpdater.java in the package cn.edu.bjut.ui.

> nohup ./update-patent-publication-year.sh > update-patent-publication-year.log 2>&1

Update the Relations between Patent and Country

The relations between patent and country can be updated by running PatentCountryUpdater.java in the package cn.edu.bjut.ui.

> nohup ./update-patent-country.sh > update-patent-country.log 2>&1

Determine PCT patents

The PCT patents can be determined by running PctFlagUpdater.java in the package cn.edu.bjut.ui.

> nohup ./update-pct-flag.sh > update-pct-flag.log 2>&1

Update Technologies

One can run PatentTechnologyUpdater.java in the package cn.edu.bjut.ui to update the technologies.

> nohup ./update-patent-technology.sh 2 ../dataset/Derwent/technologies > update-patent-technology.log 2>&1

Converter

Project: ElectricPowerConvertor.

Extract Abbreviations

The keywords with brackets should be extracted with the following SQL statement, and the resulting abbreviations are saved into the file abbreviations.xlsx.

> SELECT id, name FROM keyword WHERE id > 784519 AND (name LIKE "%(%" OR name LIKE "%)%") INTO OUTFILE "/var/lib/mysql-files/keyword2020721.csv";

The target articles and patents can be exported by running ToTextConvertor.java in the package cn.edu.bjut.converter. Then, one can extract the abbreviations from the titles and abstracts with the approach in Schwartz and Hearst (2003) by running AbbreviationExtractor.java in the package cn.edu.bjut.ui.

> nohup ./convert-wos-text.sh > convert-wos-text.log 2>&1
> nohup ./convert-derwent-text.sh > convert-derwent-text.log 2>&1

Export for TNG model

The target articles/patents can be exported for TNG (Topic N-Gram) model by running ToTNGConvertor.java in the package cn.edu.bjut.converter.

> nohup ./convert-wos-bioc.sh > convert-wos-bioc.log 2>&1
> nohup ./convert-derwent-bioc.sh > convert-derwent-bioc.log 2>&1
> nohup ./convert-tng.sh > convert-tng.log 2>&1

Calculate Novelty Indicator

The resulting journal of each cited article can be exported by running NoveltyConvertor.java in the package cn.edu.bjut.runner. In the meanwhile, the resulting IPC codes of each cited patent can also be exported by running NoveltyConvertor.java in the package cn.edu.bjut.runner. Then, the novelty indicator of each document can be calculated by following Uzzi et al. (2013).

> nohup ./convert-wos-novelty.sh ../dataset/WoS/novelty/20240421/paper-citations.txt > convert-wos-novelty.log 2>&1
> nohup ./convert-derwent-novelty.sh ../dataset/Derwent/novelty/20240421/patent-citations.txt > convert-derwent-novelty.log 2>&1

> TRUNCATE article_novelty;
> TRUNCATE patent_novelty;

> nohup ./import-article-novelty.sh ../dataset/WoS/novelty/20240421/paper_novelty.txt > import-article-novelty.log 2>&1
> nohup ./import-patent-novelty.sh ../dataset/Derwent/novelty/20240421/patent_novelty.txt > import-patent-novelty.log 2>&1

Extract Citation Network

The citation network for articles can be extracted by running WoSCitationNetworkExtractor.java in the package cn.edu.bjut.converter.

The citation network for patents can be extracted by running DerwentCitationNetworkExtractor.java in the package cn.edu.bjut.converter.

The citation network for articles and patents can be extracted by running BothCitationNetworkExtractor.java in the package cn.edu.bjut.converter.

Term-based Topic Modeling

Project: TopicalNGramsModel.

The term-based topics (Xu et al., 2021) can be discovered by running TopicalNGrams.java in the package cn.edu.bjut.ui. The top ngrams for each topic will be re-ranked by the Term Frequency-Inverse Document Frequency (TF-IDF) (Hisamitsu et al, 1999; Meyers et al., 2018), Document Relevance Document Consensus (DRDC) (Navigli and Velardi, 2004; Meyers et al., 2018), and Kullback-Leibler Divergence (KLD) (Meyers et al., 2018).

Indicator Calculator

Project: TechEmergenceIndicators.

One can run IndicatorCalculator.java in the package cn.edu.bjut.ui to calculate all indicators (Xu et al., 2021).

Attributes Extraction

> .\opennlp TokenNameFinderTrainer.brat -nameTypes ATTRIBUTE,VALUE -lang en -model en-ElectronicPower-WoS-attributes.bin -annotationConfig annotation.conf -br
atDataDir ElectronicPower-WoS-Train -ruleBasedTokenizer simple -sentenceDetectorMode en-sent.bin

目录

Data Construction for Electric Power Domain

Requirements

Data Sources

Create Database

Web of Science

Download Target Articles

Preprocess Target Articles

Import Target Articles

Import Cited Times

Merge Target Articles, Authors, Journals and Fundings

Update Keyword

Merge Cited Articles

Update Cited Articles with Target Ones

Download Cited Articles

Import Cited Articles

Merge Journals

Update Country

Update Author

Cited Patents

Update Technologies

Derwent Innovation Index

Search Strategy

Import Target Patents

Update Cited Patents with Target Ones

Download Cited Patents

Import Cited Patents

Update Publication Year

Update the Relations between Patent and Country

Determine PCT patents

Update Technologies

Converter

Extract Abbreviations

Export for TNG model

Calculate Novelty Indicator

Extract Citation Network

Term-based Topic Modeling

Indicator Calculator

Attributes Extraction