AT Model armed with Authorship Credit Allocation Scheme

Requirements

Citation Information

Shuo Xu, Ling Li, Congcong Wang, Xin An, and Guancan Yang, 2022. An Improved Author-Topic (AT) Model with Authorship Credit Allocation Schemes. Journal of Information Science.
Shuo Xu, Ling Li, Liyuan Hao, Xin An, and Guancan Yang, 2021. An Author Interest Discovery Model armed with Authorship Credit Allocation Scheme. iConference: Diversity, Divergence, Dialogue, pp. 199-207.

Create Database

The database SQL file: synthetic_biology.sql. This database consists of the following tables: author, target_article, and target_article_author.

Fill Missing DOI Information

SELECT id, title, doi, pmid, pmc_id FROM target_article WHERE doi IS NULL;

To export the above records to target_article_dois.xlsx in the directory data, and then correct manually them one by one.

Once correction is done, to run TargetArticleDoiUpdater.java to import the related information in the file data/target_article_dois.xlsx into MySQL database.

There are still three duplications with id = “WOS:000246296800029” and “WOS:000247372300026”, id = “WOS:000297670800005” and “WOS:000293697700003”, and id = “WOS:000393719000030” and id = “WOS:000394061000172”. To run the following sql, the duplications will be removed.

DELETE FROM target_article_author WHERE target_article_id = "WOS:000247372300026"; 
DELETE FROM target_article_keyword WHERE target_article_id = "WOS:000247372300026"; 
DELETE FROM target_article WHERE id = "WOS:000247372300026"; 
 
DELETE FROM target_article_author WHERE target_article_id = "WOS:000293697700003"; 
DELETE FROM target_article_keyword WHERE target_article_id = "WOS:000293697700003"; 
DELETE FROM target_article WHERE id = "WOS:000293697700003"; 
 
DELETE FROM target_article_author WHERE target_article_id = "WOS:000394061000172"; 
DELETE FROM target_article_keyword WHERE target_article_id = "WOS:000394061000172"; 
DELETE FROM target_article WHERE id = "WOS:000394061000172";

However, several target articles are attached wrong DOI names or have still no DOI names. To run the following sql, the resulting DOI names will be corrected or added.

UPDATE target_article SET doi = "10.2307/24102078" WHERE id = "WOS:000252249700032"; 
UPDATE target_article SET doi = "10.1007/978-3-540-77962-9_9" WHERE id = "WOS:000253797300009"; 
UPDATE target_article SET doi = "10.1007/978-3-540-68894-5_7" WHERE id = "WOS:000265422400007"; 
UPDATE target_article SET doi = "10.4028/WWW.SCIENTIFIC.NET/AST.58.10" WHERE id = "WOS:000266359100002"; 
UPDATE target_article SET doi = "10.1080/00365520310000654A" WHERE id = "WOS:000181977200013"; 
UPDATE target_article SET doi = "10.1016/J.JMB.2004.06.053" WHERE id = "WOS:000223379400019"; 
UPDATE target_article SET doi = "10.1145/2024724.2024750" WHERE id = "WOS:000297360000020"; 
UPDATE target_article SET doi = "10.5897/AJB11.1057" WHERE id = "WOS:000298540000011"; 
UPDATE target_article SET doi = "10.1515/1544-6115.1761" WHERE id = "WOS:000306831100007"; 
UPDATE target_article SET doi = "10.1097/00006231-200306000-00013" WHERE id = "WOS:000183373800013"; 
UPDATE target_article SET doi = "10.1097/00005176-200406001-00720" WHERE id = "WOS:000227354700101";

Update Sequence No. and Corresponding Author

SELECT ta.id AS id, ta.title AS title, ta.doi AS doi, ta.pmid AS pmid, ta.pmc_id AS pmcid, ta_a.author_id AS author_id, a.full_name AS full_name, ta_a.seq_no AS seq_no, ta_a.is_reprint AS is_reprint FROM target_article AS ta, target_article_author AS ta_a, author AS a  WHERE ta.id = ta_a.target_article_id AND ta_a.author_id = a.id AND ta.id <= "WOS:000250809900012" ORDER BY id ASC;
 
SELECT ta.id AS id, ta.title AS title, ta.doi AS doi, ta.pmid AS pmid, ta.pmc_id AS pmcid, ta_a.author_id AS author_id, a.full_name AS full_name, ta_a.seq_no AS seq_no, ta_a.is_reprint AS is_reprint FROM target_article AS ta, target_article_author AS ta_a, author AS a  WHERE ta.id = ta_a.target_article_id AND ta_a.author_id = a.id AND ta.id > "WOS:000250809900012" AND ta.id <= "WOS:000286359100002" ORDER BY id ASC;
 
SELECT ta.id AS id, ta.title AS title, ta.doi AS doi, ta.pmid AS pmid, ta.pmc_id AS pmcid, ta_a.author_id AS author_id, a.full_name AS full_name, ta_a.seq_no AS seq_no, ta_a.is_reprint AS is_reprint FROM target_article AS ta, target_article_author AS ta_a, author AS a  WHERE ta.id = ta_a.target_article_id AND ta_a.author_id = a.id AND ta.id > "WOS:000286359100002" ORDER BY id ASC;

To export the above records to synthetic_biology1.xlsx, synthetic_biology2.xlsx and synthetic_biology3.xlsx in the directory data, and then check manually them one by one.

Once correction is done, to run TargetArticleSeqNoAndIsReprintUpdater.java in the package cn.edu.bjut.ui.

For unknown reasons, three coauthors are missed from the publication with id = ““WOS:000365103600006””. To conduct the following sql statements to supplement them.

INSERT author (id, full_name, last_name, first_name) VALUES (10846, "Linard, Alban", "Linard", "Alban"); 
INSERT author (id, full_name, last_name, first_name) VALUES (10849, "Bóbeda, Edmundo López", "Bóbeda", "Edmundo López"); 
INSERT author (id, full_name, last_name, first_name) VALUES (10851, "Marechal, Alexis", "Marechal", "Alexis"); 
 
INSERT target_article_author (target_article_id, author_id, seq_no_original, seq_no, is_reprint_original, is_reprint) VALUES ("WOS:000365103600006", 10846, 4, 4, 0, 0); 
INSERT target_article_author (target_article_id, author_id, seq_no_original, seq_no, is_reprint_original, is_reprint) VALUES ("WOS:000365103600006", 10849, 5, 5, 0, 0); 
INSERT target_article_author (target_article_id, author_id, seq_no_original, seq_no, is_reprint_original, is_reprint) VALUES ("WOS:000365103600006", 10851, 6, 6, 0, 0);

Detect and Tokenize Sentences, and Recognize Entities

To run Converter2Genia.java in the package cn.edu.bjut.genia. Thus, the articles will be saved in the directory data/genia. Each article is named by its resulting id.

> ./run_geniass.sh geniass data/genia &
> ./run_geniatagger.sh geniatagger data/genia &

For each document, two files will be generated with the extension name .txt.ss and .txt.ss.tag. To save all .txt.ss and .txt.ss.tag files in the directory data/genia.

Authorship Credit Allocation Schemes

Arithmetic counting scheme: To run ArithmeticCredit.java in the package cn.edu.bjut.credit.
Geometric counting scheme: To run GeometricCredit.java in the package cn.edu.bjut.credit.
Harmonic counting scheme: To run HarmonicCredit.java in the package cn.edu.bjut.credit.
Network-based counting scheme: To run NetworkCredit.java in the package cn.edu.bjut.credit.
Axiomatic counting scheme: To run AxiomaticCredit.java in the package cn.edu.bjut.credit.
Golden number counting scheme: To run GoldenNumberCredit.java in the package cn.edu.bjut.credit.

Calculate the Coefficient of Variation

> load credits
 
> std(arithmetic(:)) / mean(arithmetic(:))
> std(geometric(:)) / mean(geometric(:))
> std(harmonic(:)) / mean(harmonic(:))
> std(network(:)) / mean(network(:))
> std(axiomatic(:)) / mean(axiomatic(:))
> std(goldenNumber(:)) / mean(goldenNumber(:))

Split Train and Test Sets

To run MultiLabelConverter.java in the package cn.edu.bjut.multilabel. In this time, two files syn_bio.corpus and syn_bio.docs in the directory data/multi-label will be generated.

> python split_data.py data/multi-label/syn_bio.corpus 0.45 data/multi-label/syn_bio.splits

To run TrainTestSetSplitter.java in the package cn.edu.bjut.multilabel. In this time, two files syn_bio.train.docs and syn_bio.test.docs in the directory data/multi-label will be generated.

To run Converter2ATCredit.java in the package cn.edu.bjut.genia. Several files will be generated for the AT^credit model in the directorydata/at_credit.

Parameter Tuning

To run ATArithmeticCreditTuningParam.java, ATAxiomaticCreditTuningParam.java, ATGeometricCreditTuningParam.java, ATGoldenNumberCreditTuningParam.java, ATHarmonicCreditTuningParam.java, and ATNetworkCreditTuningParam.java in the package cn.edu.bjut.ui. Note that if one wants to turn on the hyper-authorship strategy, the second parameter is set to true in these java files, otherwise false.

> load train_perplexity; 
 
> figure
> plotPerplexity(arithmetic_disabled, arithmetic_legends); 
> figure 
> plotPerplexity(arithmetic_enabled, arithmetic_legends); 
 
> figure
> plotPerplexity(geometric_disabled, geometric_legends); 
> figure 
> plotPerplexity(geometric_enabled, geometric_legends); 
 
> figure
> plotPerplexity(harmonic_disabled, harmonic_legends); 
> figure 
> plotPerplexity(harmonic_enabled, harmonic_legends); 
 
> figure
> plotPerplexity(network_disabled, network_legends); 
> figure 
> plotPerplexity(network_enabled, network_legends); 
 
> figure
> plotPerplexity(axiomatic_disabled, axiomatic_legends); 
> figure 
> plotPerplexity(axiomatic_enabled, axiomatic_legends); 
 
> figure
> plotPerplexity(golden_number_disabled, golden_number_legends); 
> figure 
> plotPerplexity(golden_number_enabled, golden_number_legends);

Author Interest Discovery

To run ATArithmeticCreditRunner.java, ATAxiomaticCreditRunner.java, ATGeometricCreditRunner.java, ATGoldenNumberCreditRunner.java, ATHarmonicCreditRunner.java, and ATNetworkCreditRunner.java in the package cn.edu.bjut.ui.

目录