用户工具

站点工具


zh:notes:drugbank

DrugBank Dataset Construction with Extended Information

Citation Information

  1. Shuo Xu, Zhen Liu, and Xin An, 2023. Linkages among Science, Technology, and Industry. The Joint Workshop of the 4th Extraction and Evaluation of Knowledge Entities from Scientific Documents and the 3rd AI + Informetrics (EEKE-AII 2023), pp. 13-15.
  2. Shuo Xu, Ling Li, and Xin An, 2023. Do Academic Inventors have Diverse Interests? Scientometrics, Vol. 128, No. 2, pp. 1023-1053. Note
  3. Shuo Xu, Ling Li, Xin An, Liyuan Hao, and Guancan Yang, 2021. An Approach for Detecting the Commonality and Specialty between Scientific Publications and Patents. Scientometrics, Vol. 126, No. 9, pp. 7445-7475.

Create Database

The database SQL file: drug_bank.sql.

Import DrugBank

To download drugbank_all_full_database.xml.zip from DrugBank and save it in the directory data. Before it, one account needs to be created and approved.

Then, one can import it into the database by running Importer.java in the package cn.edu.bjut.ui.

Articles

Update DOI Names

> UPDATE article SET doi = "10.1097/JTO.0B013E31802F1CD9" WHERE id = 5106;
> UPDATE article SET doi = "10.1358/DOT.2009.45.9.1418185" WHERE id = 6373; 
> UPDATE article SET doi = NULL WHERE id = 5519; 

Update PMC id and doi

To download PMC-ids-csv.gz, and save it in the directory resource.

To run ArticlePmcIdAndDoiUpdater.java in the package cn.edu.bjut.ui.

Extend with Medline/PubMed Full Text

To download Medline/PubMed Full Text in the XML format.

To extract XML files with ArticleXMLExtractor.java in the package cn.edu.bjut.ui, and save them in the directory data/articles/xml.

To import the related information into the database with ArticleXMLImporter.java in the package cn.edu.bjut.ui from the directory data/articles/xml.

To deal with exceptions with SpecialContributorProcessor.java in the package cn.edu.bjut.ui.

Extend with E-Fetch API

To extract XML files with ArticleURLExtractor.java in the package cn.edu.bjut.ui, and save them in the directory data/articles/url.

To import the related information into the database with ArticleURLImporter.java in the package cn.edu.bjut.ui from the directory data/articles/url.

Update some Information for Articles

To run MissingExcelImporter.java to import some missing information from the file data/articles_missing.xlsx.

To run UpdatingExcelImporter.java to update some special abstracts and titles from the file data/articles_updating.xlsx.

Extend with Web of Science

To download bibliographic data in batch in the format of BibTeX from the Web of Science according to DOI names extracted with the following SQL statement:

> SELECT doi FROM article WHERE doi IS NOT NULL AND title IS NULL; 

To save the downloaded data in the directory data/articles/wos. Then, to import bibliographic data to the database by running ArticleBibTexUpdater.java in the package cn.edu.bjut.ui from the directory data/articles/wos.

Update Publication Years

> UPDATE article SET publication_year = 1988 WHERE id = 6520; 
> UPDATE article SET publication_year = 2004 WHERE id = 3296; 

Update Mesh Heading

To download the MeSH data in the format of XML, and then to import it to the database by running MeshHeadingImporter.java in the package cn.edu.bjut.ui.

To extract XML files with ArticleURLExtractor.java in the package cn.edu.bjut.ui, and save them in the directory data/articles/url4mesh.

To import the related information into the database with MeshHeadingUpdater.java in the package cn.edu.bjut.ui from the directory data/articles/url4mesh.

Authors

One can correct and merge authors by running ArticleAuthorCorrector.java in the package cn.edu.bjut.ui and then ArticleChecker.java in the package cn.edu.bjut.ui.

Finally, the initials of each author are updated by running ArticleAuthorInitialsUpdater.java in the package cn.edu.bjut.ui.

Affiliation

One can extract the affiliation information from XML files by running ArticleAffiliationRawImporter.java in the package cn.edu.bjut.ui.

To recognize manually the resulting institution and country, the related information can be exported by running the following SQL statements:

> SELECT article_id, REPLACE(title, "\r\n", " ") AS title, doi, pubmed_id, pmc_id, entity_id, last_name, first_name, aes.seq_no AS seq_no, REPLACE(xml_fragment, "\r\n", "||||||||") AS xml_fragment, REPLACE(raw_affiliation, "\r\n", "||||||||") AS affiliation, aesa.seq_no AS seq_no FROM article_entity_science AS aes, article AS a, entity_science AS es, article_entity_science_affiliation AS aesa WHERE aes.entity_id = es.id AND aes.id = aesa.article_entity_science_id AND aes.article_id = a.id ORDER BY article_id ASC, aes.seq_no ASC, aesa.seq_no ASC; 
> SELECT article_id, REPLACE(title, "\r\n", " ") AS title, doi, pubmed_id, pmc_id, entity_id, last_name, first_name, aes.seq_no AS seq_no, REPLACE(affiliation, '\n', '||||||||') AS affiliation FROM article AS a, article_entity_science AS aes, entity_science AS es WHERE aes.article_id = a.id AND aes.entity_id = es.id AND a.id NOT IN ( SELECT DISTINCT article_id FROM article_entity_science AS aes, article_entity_science_affiliation AS aesa WHERE aes.id = aesa.article_entity_science_id) ORDER BY article_id ASC, seq_no ASC; 

Then, one can run ArticleAffiliationImporter.java in the package cn.edu.bjut.ui to import these affiliations and countries to the database.

Patents

To extract XML files with PatentURLExtractor.java in the package cn.edu.bjut.ui, and save them in the directory data/patents/url.

To import the related information into the database with PatentURLImporter.java in the package cn.edu.bjut.ui from the directory data/patents/url.

To update country information with PatentOriginalCountryUpdater.java in the package cn.edu.bjut.ui.

Inventors and Applicants

Split First and Last Names

The first and last name of each inventor and applicant is split by running EntityTechnologyFirstLastNameSplitter.java in the package cn.edu.bjut.ui. Then, the initials of each inventor and applicant are updated by running EntityTechnologyInitialsUpdater.java in the package cn.edu.bjut.ui.

Correct and Merge Inventors

One can correct and merge authors by running PatentInventorCorrector.java in the package cn.edu.bjut.ui.

Update Country Code

Several inventors are not attached any country code at all. One can retrieve these inventors by running the following SQL statement.

> SELECT * FROM entity_technology WHERE country_code IS NULL AND last_name IS NOT NULL ; 

As for our case, there are 39 inventors in total. The resulting country codes are supplemented manually one by one. Then, by running EntityTechnologyCountryUpdater.java in the package cn.edu.bjut.ui, the supplemented country codes can be imported to the database.

Academic Inventors

To recognize academic inventors, the candidates can be exported by running the following SQL statement:

> SELECT * FROM entity_science AS es, entity_technology AS et WHERE es.last_name = et.last_name AND SUBSTRING(es.initials, 1, 1) = SUBSTRING(et.initials, 1, 1); 

Then, academic inventors are recognized manually one by one. Finally, one can import the recognized academic inventors to the database by running EntityLinkage.java in the package cn.edu.bjut.ui.

Cited Patents

The files patent.tsv.zip and application.tsv.zip are downloaded from the PatentsView. Then, cited patents are updated by running CitedPatentUpdaterWithPatentsView.java in the package cn.edu.bjut.ui.

Many US patents are still not found in PatentsView. One can download them in batch from Patsnap. Then, related information can be imported into the database by running CitedPatentUSUpdaterWithExcel.java in the package cn.edu.bjut.ui.

Similarly, the non-US patents are downloaded in batch from Patsnap. Then, related information can be imported into the database by running CitedPatentUpdaterWithExcel.java in the package cn.edu.bjut.ui.

Science Non-Patent References

Update with Excel

The non-patent references (NPRs) include many types of documents, such as articles, books, reports, and so on. For purpose of identifying scientific NPRs, the original NPRs can be exported by running the following SQL statement:

> SELECT id, other_citation, title, abst, pubmed_id, pmc_id, doi FROM non_patent ORDER BY id ASC; 

One can check manually whether or not each record belongs to the type of scholarly articles. Then, one can update some information for scientific NPRs by running NonPatentUpdaterWithExcel.java in the package cn.edu.bjut.ui.

Update PMC id and PubMed id

To download PMC-ids-csv.gz, and save it in the directory resource.

To run NonPatentPubmedidAndPmcIdUpdater.java in the package cn.edu.bjut.ui.

Merge Duplicate Records

The original NPRs contains many duplicate records. According to DOI, PMC id, and PubMed id, one can merge them by running NonPatentMerger.java in the package cn.edu.bjut.ui.

Update with Articles

One can update science NPRs with the information in articles by running NonPatentUpdaterWithArticle.java in the package cn.edu.bjut.ui.

Extend with Medline/PubMed Full Text

To download Medline/PubMed Full Text in the XML format.

To extract XML files with NonPatentXMLExtractor.java in the package cn.edu.bjut.ui, and save them in the directory data/non_patents/xml.

To import the related information into the database with NonPatentXMLImporter.java in the package cn.edu.bjut.ui from the directory data/non_patents/xml.

Extend with E-Fetch API

To extract XML files with NonPatentURLExtractor.java in the package cn.edu.bjut.ui, and save them in the directory data/non_patents/url.

To import the related information into the database with NonPatentURLImporter.java in the package cn.edu.bjut.ui from the directory data/non_patents/url.

Extend with Web of Science

To download bibliographic data in batch in the format of BibTeX from the Web of Science according to DOI names extracted with the following SQL statement:

> SELECT doi FROM non_patent WHERE doi IS NOT NULL AND title IS NULL; 

To save the downloaded data in the directory data/non_patents/wos. Then, to import bibliographic data to the database by running NonPatentBibTexImporter.java in the package cn.edu.bjut.ui from the directory data/non_patents/wos.

Update Publication Years

> UPDATE non_patent SET publication_year = 2009 WHERE id = 3984; 

Update with Another Excel

Several documents still contain incorrect information. One can update some information for science NPRs by running NonPatentUpdaterWithAnotherExcel.java in the package cn.edu.bjut.ui.

Cited Articles

Update DOI Names

Several DOI numbers do not follow the specification (Xu et al. 2019). In addition, several cited articles are attached multiple DOI numbers. One can retrieve those cited articles by running the following SQL statement.

> SELECT id, text, doi_original, doi FROM cited_article WHERE flag = 1; 

As for our case, there are 274 records in total. These DOI numbers are corrected manually one by one with the help of the following SQL code.

> SELECT a.id AS id, title, doi, publication_year FROM article_cited_article AS a_ca, article AS a WHERE a.id = a_ca.article_id AND a_ca.cited_article_id = ???;

Then, one can run CitedArticleDoiUpdater.java in the package cn.edu.bjut.ui to import these DOI numbers to the database.

By running CitedArticleDoiLogMerger.java, the records with the same DOI number will be merged according to log file CitedArticleDoiUpdater.log in the directory of data.

One can deal with several exceptions by running the following SQL statements.

-- cited_article_id = 4122
> DELETE FROM article_cited_article WHERE cited_article_id = 4122 AND article_id != 9211; 
> UPDATE cited_article SET doi = "10.1200/JCO.2016.34.15_suppl.4502", flag = 0 WHERE id = 4122; 
-- cited_article_id = 5593
> DELETE FROM article_cited_article WHERE cited_article_id = 5593 AND article_id = 6948; 
> UPDATE cited_article SET text = "Siegel R, Ma J, Zou Z, Jemal A. Cancer statistics, 2014. CA Cancer J Clin 2014;64:9-29", doi = "10.3322/CAAC.21208", SOURCE = "MANUAL", flag = 0 WHERE id = 5593; 
> INSERT cited_article (text, doi, SOURCE) VALUES ("Siegel RL, Miller KD, Jemal A. Cancer statistics, 2015. CA Cancer J Clin 2015;65:5-29", "10.3322/CAAC.21254", "MANUAL"); 
> UPDATE article_cited_article SET cited_article_id = 290639 WHERE cited_article_id = 5593 AND article_id = 4822; 
-- cited_article_id = 6281
> UPDATE cited_article SET doi = "10.1002/14651858.CD005454", flag = 0 WHERE id = 6281; 
-- cited_article_id = 36690
> UPDATE cited_article SET text = "Simonneau G, Gatzoulis MA, Adatia I, et al. Updated clinical classification of pulmonary hypertension. J Am Coll Cardiol 2013;62:D34-41", doi = "10.1016/j.jacc.2013.10.029", SOURCE = "MANUAL", flag = 0 WHERE id = 36690; 
> INSERT cited_article (text, doi, SOURCE) VALUES ("Simonneau G, Robbins IM, Beghetti M, et al. Updated clinical classification of pulmonary hypertension. J Am Coll Cardiol.2009;54(1)(suppl):S43-S54", "10.1016/j.jacc.2009.04.012", "MANUAL"); 
> UPDATE article_cited_article SET cited_article_id = 290640 WHERE cited_article_id = 36690 AND article_id IN (5196, 5200)
-- cited_article_id = 39807
> UPDATE cited_article SET text = "European Heart Rhythm Association, European Association for Cardio-Thoracic Surgery, Camm AJ, et al. Guidelines for the management of atrial fibrillation: the Task Force for the Management of Atrial Fibrillation of the European Society of Cardiology (ESC). Eur Heart J. 2010;31:2369-429.", doi = "10.1093/EURHEARTJ/EHQ278", SOURCE = "MANUAL", flag = 0 WHERE id = 39807; 
> INSERT cited_article (text, doi, SOURCE) VALUES ("Camm AJ, Kirchhof P, Lip GY, Schotten U, Savelieva I, Ernst S et al. Guidelines for the management of atrial fibrillation: the Task Force for the Management of Atrial Fibrillation of the European Society of Cardiology (ESC). Europace 2010;12:1360–420.", "10.1093/EUROPACE/EUQ350", "MANUAL"); 
> UPDATE article_cited_article SET cited_article_id = 290641 WHERE cited_article_id = 39807 AND article_id = 5028; 
-- cited_article_id = 39913
> UPDATE cited_article SET text = "D.C. Goff Jr., D.M. Lloyd-Jones, G. Bennett, et al. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines.J Am Coll Cardiol, 63 (2014), pp. 2935-2959.", doi = "10.1016/J.JACC.2013.11.005", SOURCE = "MANUAL", flag = 0 WHERE id = 39913; 
> INSERT cited_article (text, doi, SOURCE) VALUES ("Goff DC Jr, Lloyd-Jones DM, Bennett G, Coady S, D’Agostino RB Sr, Gibbons R, Greenland P, Lackland DT, Levy D, O’Donnell CJ, Robinson J, Schwartz JS, Shero ST, Smith SC Jr, Sorlie P, Stone NJ, Wilson PW. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines [published online ahead of print November 12, 2013]. Circulation. doi:10.1161/01.cir.0000437741.48606.98.", "10.1161/01.CIR.0000437741.48606.98", "MANUAL"); 
> UPDATE article_cited_article SET cited_article_id = 290642 WHERE cited_article_id = 39913 AND article_id = 1889; 
-- cited_article_id = 72821
> UPDATE cited_article SET doi = "10.1002/14651858.CD009154", flag = 0 WHERE id = 72821; 
-- cited_article_id = 111070
> UPDATE cited_article SET text = "Hsu DT, Sanford BJ, Meyers KK, Love TM, Hazlett KE, Wang H et al (2013a). Response of the mu-opioid system to social rejection and acceptance. Mol Psychiatry 18: 1211–1217.", doi = "10.1038/MP.2013.96", SOURCE = "MANUAL", flag = 0 WHERE id = 111070; 
> INSERT cited_article (text, doi, SOURCE) VALUES ("Hsu DT, Sanford BJ, Meyers KK, Love TM, Hazlett KE, Wang H et al (2013b). Social feedback activates the endogenous opioid system. Mol Psychiatry 18: 1147.", "10.1038/MP.2013.139", "MANUAL"); 
> INSERT article_cited_article (article_id, cited_article_id) VALUES (6831, 290643); 
-- cited_article_id = 148457
> UPDATE cited_article SET text = "Osmon DR, Berbari EF, Berendt AR, Lew D, Zimmerli W, Steckelberg JM, Rao N, Hanssen A, Wilson WR, Infectious Diseases Society of A. 2013. Executive summary: diagnosis and management of prosthetic joint infection: clinical practice guidelines by the Infectious Diseases Society of America. Clin Infect Dis 56:1-10.", doi = "10.1093/CID/CIS966", SOURCE = "MANUAL", flag = 0 WHERE id = 148457; 
> INSERT cited_article (text, doi, SOURCE) VALUES ("Osmon DR, Berbari EF, Berendt AR, et al. Diagnosis and management of prosthetic joint infection: clinical practice guidelines by the Infectious Diseases Society of America. Clin Infect Dis 2013; 56:e1–e25.", "10.1093/CID/CIS803", "MANUAL"); 
> UPDATE article_cited_article SET cited_article_id = 290643 WHERE cited_article_id = 148457 AND article_id = 1745; 
-- cited_article_id = 248946
> UPDATE cited_article SET text = "Takusagawa S, Yajima K, Miyashita A, et al. Identification of human cytochrome P450 isoforms and esterases involved in the metabolism of mirabegron, a potent and selective beta3-adrenoceptor agonist. Xenobiotica 2012;42(10):957-67", doi = "10.3109/00498254.2012.675095", SOURCE = "MANUAL", flag = 0 WHERE id = 248946; 
> INSERT cited_article (text, doi, SOURCE) VALUES ("Takusagawa S, Miyashita A, Iwatsubo T, Usui T. In vitro inhibition and induction of human cytochrome P450 enzymes by mirabegron, a potent and selective beta3-adrenoceptor agonist. Xenobiotica 2012;42(12):1187-96", "10.3109/00498254.2012.700140", "MANUAL"); 
> INSERT article_cited_article (article_id, cited_article_id) VALUES (5900, 290645); 
> UPDATE cited_article SET doi = "10.1080/10550490903077929", flag = 0 WHERE id = 81627; 
> UPDATE cited_article SET doi = "10.1016/S0002-9270(02)06028-8", flag = 0 WHERE id = 208624; 
> UPDATE cited_article SET doi = "10.1016/S0002-9270(03)00623-3", flag = 0 WHERE id = 266847; 
> UPDATE cited_article SET doi = "10.1079/BJN2002821", flag = 0 WHERE id = 254572; 
> UPDATE cited_article SET doi = "10.1093/NDT/GFG1056", flag = 0 WHERE id = 179626;

One can run CitedArticleDoiChecker.java in the package cn.edu.bjut.ui to check whether multiple cited articles are attached a same DOI number. In this time, two cited articles are found to share a same DOI number. One can merge these cited articles by running the following SQL statements.

> UPDATE article_cited_article SET cited_article_id = 6281 WHERE cited_article_id = 155267; 
> DELETE FROM cited_article WHERE id = 155267; 

Update PMC id and PubMed id

To download PMC-ids-csv.gz, and save it in the directory resource.

To run CitedArticlePubmedIdAndPmcIdUpdator.java in the package cn.edu.bjut.ui.

Extend with Medline/PubMed Full Text

To download Medline/PubMed Full Text in the XML format.

To extract XML files with CitedArticleXMLExtractor.java in the package cn.edu.bjut.ui, and save them in the directory data/cited_articles/xml.

To import the related information into the database with CitedArticleXMLImporter.java in the package cn.edu.bjut.ui from the directory data/cited_articles/xml.

Extend with E-Fetch API

To extract XML files with CitedArticlePubMedIdURLExtractor.java in the package cn.edu.bjut.ui, and save them in the directory data/cited_articles/url/pubmed.

To import the related information into the database with CitedArticlePubMedIdURLImporter.java in the package cn.edu.bjut.ui from the directory data/cited_articles/url/pubmed.

To extract XML files with CitedArticlePmcIdURLExtractor.java in the package cn.edu.bjut.ui, and save them in the directory data/cited_articles/url/pmc.

To import the related information into the database with CitedArticlePmcIdURLImporter.java in the package cn.edu.bjut.ui from the directory data/cited_articles/url/pmc.

Extend with Web of Science

To download bibliographic data in batch in the format of BibTeX from the Web of Science according to DOI names extracted with the following SQL statement:

> SELECT doi FROM cited_article WHERE title IS NULL AND doi IS NOT NULL AND flag = 0; 

To save the downloaded data in the directory data/cited_articles/wos. Then, to import bibliographic data to the database by running CitedArticleUpdaterWithBibTex.java in the package cn.edu.bjut.ui from the directory data/cited_articles/wos.

Extend with Excel

Several exceptions cannot be downloaded from the Web of Science. Hence, we arrange them in an excel file. Thus, one can import them into the database by running CitedArticleUpdaterWithExcel.java in the package cn.edu.bjut.ui.

Patent Cases

There are many references with the type of patents in our cited articles. One can retrieve them with the following SQL statement, and then check them one by one.

> SELECT * FROM cited_article WHERE text LIKE "%patent%"; 

There is one duplication with id = 193605 and 193646. To run the following SQL statements, the duplication will be removed.

> UPDATE article_cited_article SET cited_article_id = 193605 WHERE cited_article_id = 193646; 
> DELETE FROM cited_article WHERE id = 193646; 

These patents can be downloaded in batch from Patsnap. Then, related information can be imported into the database by running CitedArticlePatentsUpdaterWithExcel.java in the package cn.edu.bjut.ui.

Update Publication Years and other Fields

> UPDATE cited_article SET text = "Asselah T, De Muynck S, Broet P, et al. IL28B polymorphism is associated with treatment response in patients with genotype 4 chronic hepatitis C. J Hepatol 2011;56(3):527-32", title = "IL28B polymorphism is associated with treatment response in patients with genotype 4 chronic hepatitis C", abst = "Background & Aims: Polymorphisms in the region of the interleukin (IL)28B gene have been associated with pegylated-interferon (PEG-IFN) and ribavirin treatment response mainly in genotype 1 HCV infections. However, there are few data on HCV genotype 4 (HCV-4) infection. We evaluated, in a unique well-characterized cohort of HCV-4 patients, the association of IL28B polymorphism with response to treatment or liver disease severity. Methods: This study included 164 HCV-4 patients from different ethnic groups (Egyptian, European, and Sub-Saharan African). Among these patients, 82 were studied for response and 160 for disease severity. Free DNA extracted from all the 164 patient’s serum samples was analyzed by direct sequencing of the SNP rs12979860 of IL28B. Genetic and bio-clinical features from patients having sustained virological response (43 SVR patients) and from those who did not respond to treatment or had a relapse after the end of the treatment (39 NR patients) were compared. IL28B polymorphism was compared between the 78 patients with mild fibrosis (Metavir score F0-F1) and the 82 with advanced fibrosis (F2-F4). Results: Our data showed a better treatment response rate of the C allele of the IL28B gene SNP rs12979860 (p = 0.0008). The response rates were 81.8%, 46.5%, and 29.4% for genotype CC, CT, and TT, respectively. No significant relationship was found between rs12979860 and the severity of the disease. Conclusions: The SNP rs12979860 is strongly associated with SVR in patients infected with HCV-4, but not with liver disease severity. Analysis of IL28B genotype might be used to guide treatment for these patients.", publication_year = 2012 WHERE id = 213854; 
zh/notes/drugbank.txt · 最后更改: 2023/12/09 12:18 由 pzczxs