AT Model armed with Authorship Credit

Requirements

Citation Information

To be added later.

Create Database

The database SQL file: synthetic_biology.sql. This database consists of the following tables: author, cited_author, cited_article_author, citing_article, citing_article_cited_article, citing_article_keyword, keyword, target_article, target_article_author, and target_article_keyword.

Fill Missing DOI Information

SELECT id, title, doi, pmid, pmc_id FROM target_article WHERE doi IS NULL;

To export the above records to target_article_dois.xlsx in the directory data, and then correct manually them one by one.

Once correction is done, to run TargetArticleDoiUpdater.java to import the related information in the file data/target_article_dois.xlsx into MySQL database.

There are still three duplications with id = “WOS:000246296800029” and “WOS:000247372300026”, id = “WOS:000297670800005” and “WOS:000293697700003”, and id = “WOS:000393719000030” and id = “WOS:000394061000172”. To run the following SQL statements, the duplications will be removed.

DELETE FROM target_article_author WHERE target_article_id = "WOS:000247372300026"; 
DELETE FROM target_article_keyword WHERE target_article_id = "WOS:000247372300026"; 
DELETE FROM target_article WHERE id = "WOS:000247372300026"; 
 
DELETE FROM target_article_author WHERE target_article_id = "WOS:000293697700003"; 
DELETE FROM target_article_keyword WHERE target_article_id = "WOS:000293697700003"; 
DELETE FROM target_article WHERE id = "WOS:000293697700003"; 
 
DELETE FROM target_article_author WHERE target_article_id = "WOS:000394061000172"; 
DELETE FROM target_article_keyword WHERE target_article_id = "WOS:000394061000172"; 
DELETE FROM target_article WHERE id = "WOS:000394061000172";

However, several target articles are attached wrong DOI names or have still no DOI names. To run the following SQL statements, the resulting DOI names will be corrected or added.

UPDATE target_article SET doi = "10.2307/24102078" WHERE id = "WOS:000252249700032"; 
UPDATE target_article SET doi = "10.1007/978-3-540-77962-9_9" WHERE id = "WOS:000253797300009"; 
UPDATE target_article SET doi = "10.1007/978-3-540-68894-5_7" WHERE id = "WOS:000265422400007"; 
UPDATE target_article SET doi = "10.4028/WWW.SCIENTIFIC.NET/AST.58.10" WHERE id = "WOS:000266359100002"; 
UPDATE target_article SET doi = "10.1080/00365520310000654A" WHERE id = "WOS:000181977200013"; 
UPDATE target_article SET doi = "10.1016/J.JMB.2004.06.053" WHERE id = "WOS:000223379400019"; 
UPDATE target_article SET doi = "10.1145/2024724.2024750" WHERE id = "WOS:000297360000020"; 
UPDATE target_article SET doi = "10.5897/AJB11.1057" WHERE id = "WOS:000298540000011"; 
UPDATE target_article SET doi = "10.1515/1544-6115.1761" WHERE id = "WOS:000306831100007"; 
UPDATE target_article SET doi = "10.1097/00006231-200306000-00013" WHERE id = "WOS:000183373800013"; 
UPDATE target_article SET doi = "10.1097/00005176-200406001-00720" WHERE id = "WOS:000227354700101"; 
UPDATE target_article SET doi = "10.14670/HH-26.471" WHERE id = "WOS:000287804300007";

Update Sequence No. and Corresponding Author

SELECT ta.id AS id, ta.title AS title, ta.doi AS doi, ta.pmid AS pmid, ta.pmc_id AS pmcid, ta_a.author_id AS author_id, a.full_name AS full_name, ta_a.seq_no AS seq_no, ta_a.is_reprint AS is_reprint FROM target_article AS ta, target_article_author AS ta_a, author AS a  WHERE ta.id = ta_a.target_article_id AND ta_a.author_id = a.id AND ta.id <= "WOS:000250809900012" ORDER BY id ASC;
 
SELECT ta.id AS id, ta.title AS title, ta.doi AS doi, ta.pmid AS pmid, ta.pmc_id AS pmcid, ta_a.author_id AS author_id, a.full_name AS full_name, ta_a.seq_no AS seq_no, ta_a.is_reprint AS is_reprint FROM target_article AS ta, target_article_author AS ta_a, author AS a  WHERE ta.id = ta_a.target_article_id AND ta_a.author_id = a.id AND ta.id > "WOS:000250809900012" AND ta.id <= "WOS:000286359100002" ORDER BY id ASC;
 
SELECT ta.id AS id, ta.title AS title, ta.doi AS doi, ta.pmid AS pmid, ta.pmc_id AS pmcid, ta_a.author_id AS author_id, a.full_name AS full_name, ta_a.seq_no AS seq_no, ta_a.is_reprint AS is_reprint FROM target_article AS ta, target_article_author AS ta_a, author AS a  WHERE ta.id = ta_a.target_article_id AND ta_a.author_id = a.id AND ta.id > "WOS:000286359100002" ORDER BY id ASC;

To export the above records to synthetic_biology1.xlsx, synthetic_biology2.xlsx and synthetic_biology3.xlsx in the directory data, and then check manually them one by one.

Once correction is done, to run TargetArticleSeqNoAndIsReprintUpdater.java in the package cn.edu.bjut.ui.

For unknown reasons, three coauthors are missed from the publication with id = ““WOS:000365103600006””. To conduct the following sql statements to supplement them.

INSERT author (id, full_name, last_name, first_name) VALUES (10846, "Linard, Alban", "Linard", "Alban"); 
INSERT author (id, full_name, last_name, first_name) VALUES (10849, "Bóbeda, Edmundo López", "Bóbeda", "Edmundo López"); 
INSERT author (id, full_name, last_name, first_name) VALUES (10851, "Marechal, Alexis", "Marechal", "Alexis"); 
 
INSERT target_article_author (target_article_id, author_id, seq_no_original, seq_no, is_reprint_original, is_reprint) VALUES ("WOS:000365103600006", 10846, 4, 4, 0, 0); 
INSERT target_article_author (target_article_id, author_id, seq_no_original, seq_no, is_reprint_original, is_reprint) VALUES ("WOS:000365103600006", 10849, 5, 5, 0, 0); 
INSERT target_article_author (target_article_id, author_id, seq_no_original, seq_no, is_reprint_original, is_reprint) VALUES ("WOS:000365103600006", 10851, 6, 6, 0, 0);

Fetch and Import Citing Articles

To run DownloadByWosId.java, a file citing_article.wos_id in the directory data will be generated. According to this file, to fetch full-record and cited references in the format of bibtex from Web of Science, and to save them in the directory data/wos/citing.

To run CitingArticleBibTexImporter.java.

Update the doi Field of the Records with Multiple DOIs

To run CitedArticleDoiResolver.java, and to save log file CitedArticleDoiResolver.log in the directory of data.

To run CitedArticleDoiLogMerger. The records with the same doi name will be merged according to log file CitedArticleDoiResolver.log in the directory of data.

Correct the incorrect doi

SELECT id, text, doi, parsed_flag FROM cited_article WHERE flag = 1;

To export the above records to cited_article_dois.xlsx in the directory data, and then correct manually them one by one with the help of the following SQL code.

SELECT c.id AS id, title, doi FROM citing_article_cited_article AS cc, citing_article AS c WHERE c.id = cc.citing_article_id AND cc.cited_article_id = ???;

Once correction is done, to run CitedArticleDoiSpliter.java to import the related information in the file data/cited_article_dois.xlsx into MySQL database, and to save log file CitedArticleDoiSpliter.log in the directory of data.

To run CitedArticleDoiLogMerger. The records with the same doi name will be merged according to log file CitedArticleDoiSpliter.log in the directory of data.

To run CitedArticleDoiUpdater.java to import the related information in the file data/cited_article_dois.xlsx into MySQL database.

Note that several records have multiple DOI names, which can be resolved to a same publication. In order to avoid duplication, to run CitedArticleMultiDoiMerger.java.

Update PubMed id and PMC id

To download PMC-ids-csv.gz, and save it in the directory resource.

To run PubMedIdUpdator.java.

Data Augment with Medline/PubMed Full Text

To download Medline/PubMed Full Text in the XML format.

To extract XML files with CitedArticleXMLExtractor.java, and save them in the directory data/xml.

To import the related information into the database with CitedArticleXMLImporter.java from the directory data/xml.

Data Augment with E-Fetch API

To extract XML files with CitedArticleURLExtractor.java, and save them in the directories data/url/pmc and data/url/pubmed.

To import the related information into the database with CitedArticleURLImporter.java from the directories data/url/pmc and data/url/pubmed.

Fetch and Import Cited Articles

To run DownloadByDoi.java, several files with the name ref-NUMBER.doi will be generated in the directory data/download_by_dois. According to this file, to fetch full-record and cited references in the format of BibTex from Core Collection in the Web of Science, and to save data/wos/cited.

To import the related information into the database with CitedArticleBibTexImporter.java from the directory data/wos/cited.

Separate Last Name from First Name

SELECT id, full_name, first_name, last_name, emails FROM author WHERE last_name IS NULL AND first_name IS NULL;

To export the above records to authors.xlsx in the directory data, and then separate manually last name from first name one by one.

Once correction is done, to run AuthorFirstLastNameSplitter.java in the package cn.edu.bjut.ui.

Then, the following three records in the author table and the related records.

DELETE FROM cited_article_author WHERE author_id IN (691856,752299,1328356);
DELETE FROM author WHERE id IN (691856,752299,1328356); // "et al.", "<colla/>", AND "[Anonymous]"

Authorship Credit Allocation Schemes

Arithmetic counting scheme: To run ArithmeticCredit.java in the package cn.edu.bjut.credit.
Geometric counting scheme: To run GeometricCredit.java in the package cn.edu.bjut.credit.
Harmonic counting scheme: To run HarmonicCredit.java in the package cn.edu.bjut.credit.
Network-based counting scheme: To run NetworkCredit.java in the package cn.edu.bjut.credit.
Axiomatic counting scheme: To run AxiomaticCredit.java in the package cn.edu.bjut.credit.
Golden number counting scheme: To run GoldenNumberCredit.java in the package cn.edu.bjut.credit.

Calculate the Coefficient of Variation

> load credits
> std(arithmetic(:)) / mean(arithmetic(:))
> std(geometric(:)) / mean(geometric(:))
> std(harmonic(:)) / mean(harmonic(:))
> std(network(:)) / mean(network(:))
> std(axiomatic(:)) / mean(axiomatic(:))
> std(goldenNumber(:)) / mean(goldenNumber(:))

Detect and Tokenize Sentences, and Recognize Entities

To run Converter2Genia.java in the package cn.edu.bjut.genia. Thus, the articles will be saved in the directory data/genia. Each article is named by its resulting id.

> ./run_geniass.sh geniass data/genia &
> ./run_geniatagger.sh geniatagger data/genia &

For each document, two files will be generated with the extension name .txt.ss and .txt.ss.tag. To save all .txt.ss and .txt.ss.tag files in the directory data/genia.

Split Train and Test Sets

To run MultiLabelConverter.java in the package cn.edu.bjut.multilabel. In this time, two files syn_bio.corpus and syn_bio.docs in the directory data/multi-label will be generated.

> python split_data.py data/multi-label/syn_bio.corpus 0.45 data/multi-label/syn_bio.splits

To run TrainTestSetSplitter.java in the package cn.edu.bjut.multilabel. In this time, two files syn_bio.train.docs and syn_bio.test.docs in the directory data/multi-label will be generated.

Parameter Tuning

To run ATArithmeticCreditTuningParam.java, ATAxiomaticCreditTuningParam.java, ATGeometricCreditTuningParam.java, ATGoldenNumberCreditTuningParam.java, ATHarmonicCreditTuningParam.java, and ATNetworkCreditTuningParam.java in the package cn.edu.bjut.ui. Note that if one wants to turn on the hyper-authorship strategy, the second parameter is set to true in these java files, otherwise false.

> load train_perplexity; 
 
> figure
> plotPerplexity(arithmetic_disabled, arithmetic_legends); 
> figure 
> plotPerplexity(arithmetic_enabled, arithmetic_legends); 
 
> figure
> plotPerplexity(geometric_disabled, geometric_legends); 
> figure 
> plotPerplexity(geometric_enabled, geometric_legends); 
 
> figure
> plotPerplexity(harmonic_disabled, harmonic_legends); 
> figure 
> plotPerplexity(harmonic_enabled, harmonic_legends); 
 
> figure
> plotPerplexity(network_disabled, network_legends); 
> figure 
> plotPerplexity(network_enabled, network_legends); 
 
> figure
> plotPerplexity(axiomatic_disabled, axiomatic_legends); 
> figure 
> plotPerplexity(axiomatic_enabled, axiomatic_legends); 
 
> figure
> plotPerplexity(golden_number_disabled, golden_number_legends); 
> figure 
> plotPerplexity(golden_number_enabled, golden_number_legends);

Author Interest Discovery

To run ATArithmeticCreditRunner.java, ATAxiomaticCreditRunner.java, ATGeometricCreditRunner.java, ATGoldenNumberCreditRunner.java, ATHarmonicCreditRunner.java, and ATNetworkCreditRunner.java in the package cn.edu.bjut.ui.

目录