目录

Overlapping Thematic Structures Extraction with Mixed-Membership Stochastic Blockmodel

data source: Topic Extraction Challenge

Citation Information

Shuo Xu, Junwan Liu, Dongsheng Zhai, Xin An, Zheng Wang, and Hongshen Pang, 2018. Overlapping Thematic Structures Extraction with Mixed-Membership Stochastic Blockmodel. Scientometrics, Vol. 117, No. 1, pp. 61-84. Results

Notes

astro: data directory

tools: tool directory

doc: documents

Requirements

> java -jar AstroConverter.jar DirectCitation -input ../astro/direct_citations.txt -output ../astro/direct
> java -jar AstroConverter.jar BibliographicCoupling -threshold 4 -weight true -input ../astro/citation_links.txt -output ../astro/coupling
> java -jar AstroConverter.jar Cocitation -threshold 1 -weight true -input ../astro/citation_links.txt -output ../astro/cocitation

Extract giant component

> python extract_giant.py -weight false ../astro/direct.edgelist
> python extract_giant.py -weight true ../astro/coupling.edgelist
> python extract_giant.py -weight true ../astro/cocitation.edgelist

Discover overlapping communities

Direct Citation

> svinet -file astro/direct.edgelist.giant -n 101831 -k 101831 -eta-type fromdata -findk
> wc -l n101831-k101831-mmsb-findk/communities.txt
> svinet -file astro/direct.edgelist.giant -n 101831 -k 2396 -eta-type fromdata -link-sampling
> cd n101831-k2396-mmsb-linksampling
> svinet -file ../astro/direct.edgelist.giant -n 101831 -k 2396 -gml
 
> svinet -file astro/direct.edgelist.giant -n 101831 -k 113 -eta-type fromdata -link-sampling
> cd n101831-k113-mmsb-linksampling
> svinet -file ../astro/direct.edgelist.giant -n 101831 -k 113 -gml

Bibliographic Coupling

> svinet -file astro/coupling.edgelist.giant -n 101053 -k 101053 -eta-type fromdata -findk
> wc -l n101053-k101053-mmsb-findk/communities.txt
> svinet -file astro/coupling.edgelist.giant -n 101053 -k 992 -eta-type fromdata -link-sampling
> cd n101053-k992-mmsb-linksampling
> svinet -file ../astro/coupling.edgelist.giant -n 101053 -k 992 -gml
 
> svinet -file astro/coupling.edgelist.giant -n 101053 -k 113 -eta-type fromdata -link-sampling
> cd n101053-k113-mmsb-linksampling
> svinet -file ../astro/coupling.edgelist.giant -n 101053 -k 113 -gml

Co-citation

> svinet -file astro/cocitation.edgelist.giant -n 82895 -k 82895 -eta-type fromdata -findk
> wc -l n82895-k82895-mmsb-findk/communities.txt
> svinet -file astro/cocitation.edgelist.giant -n 82895 -k 634 -eta-type fromdata -link-sampling
> cd n82895-k634-mmsb-linksampling
> svinet -file ../astro/cocitation.edgelist.giant -n 82895 -k 634 -gml
 
> svinet -file astro/cocitation.edgelist.giant -n 82895 -k 113 -eta-type fromdata -link-sampling
> cd n82895-k113-mmsb-linksampling
> svinet -file ../astro/cocitation.edgelist.giant -n 82895 -k 113 -gml

Extract Terms with C-Value Methods

> java -jar AstroConverter.jar Term -input ../astro/astro-ALP-2003-2010.csv -output astro
> java -Xmx8g -XX:-UseGCOverheadLimit -cp jate-2.0-beta.1-jar-with-dependencies.jar uk.ac.shef.dcs.jate.app.AppCValue -corpusDir astro -c true -pf.mttf 3 -o cvalue-terms.json solr-testbed ACLRDTEC
> python extract_terms.py cvalue-terms.json

Topic Labeling

> python labeler.py hard cvalue-terms.json.terms ../astro/direct.docs.vocab ../astro/n101831-k113-mmsb-linksampling/communities.txt ../astro/direct.micro.txt
> python labeler.py hard cvalue-terms.json.terms ../astro/coupling.docs.vocab ../astro/n101053-k113-mmsb-linksampling/communities.txt ../astro/coupling.micro.txt
> python labeler.py hard cvalue-terms.json.terms ../astro/cocitation.docs.vocab ../astro/n82895-k113-mmsb-linksampling/communities.txt ../astro/cocitation.micro.txt
 
> python labeler.py soft cvalue-terms.json.terms ../astro/direct.docs.vocab ../astro/n101831-k113-mmsb-linksampling/communities.txt ../astro/n101831-k113-mmsb-linksampling/groups.txt
> python labeler.py soft cvalue-terms.json.terms ../astro/coupling.docs.vocab ../astro/n101053-k113-mmsb-linksampling/communities.txt ../astro/n101053-k113-mmsb-linksampling/groups.txt
> python labeler.py soft cvalue-terms.json.terms ../astro/cocitation.docs.vocab ../astro/n82895-k113-mmsb-linksampling/communities.txt ../astro/n82895-k113-mmsb-linksampling/groups.txt

Utilities

> index = GiniIndex('../astro/direct.edgelist.giant.degree', '../astro/coupling.edgelist.giant.degree', '../astro/cocitation.edgelist.giant.degree')
> python statistics.py ../astro/n101831-k2396-mmsb-linksampling/communities.txt
> python statistics.py ../astro/n101053-k992-mmsb-linksampling/communities.txt
> python statistics.py ../astro/n82895-k634-mmsb-linksampling/communities.txt
 
> python statistics.py ../astro/n101831-k113-mmsb-linksampling/communities.txt
> python statistics.py ../astro/n101053-k113-mmsb-linksampling/communities.txt
> python statistics.py ../astro/n82895-k113-mmsb-linksampling/communities.txt
> load cluster_size
> cluster_size_distribution(direct_2396(:, 2), 100, 100)
> cluster_size_distribution(coupling_992(:, 2), 100, 100)
> cluster_size_distribution(cocitation_634(:, 2), 100, 100)
> cluster_size_distribution(direct_113(:, 2), 50, 200)
> cluster_size_distribution(coupling_113(:, 2), 50, 200)
> cluster_size_distribution(cocitation_113(:, 2), 50, 200)

Q-Q (Quantile-Quantile) Plot: Analyze–>Discriptive Statistics–>Explore–>Plots, Normality plots with tests with selected.

> python membership.py ../astro/n101831-k113-mmsb-linksampling/communities.txt
> python membership.py ../astro/n101053-k113-mmsb-linksampling/communities.txt
> python membership.py ../astro/n82895-k113-mmsb-linksampling/communities.txt
 
> DistOfMemberships('../astro/n101831-k113-mmsb-linksampling/communities.dist')
> DistOfMemberships('../astro/n101053-k113-mmsb-linksampling/communities.dist')
> DistOfMemberships('../astro/n82895-k113-mmsb-linksampling/communities.dist')
> java -jar AstroConverter.jar Overlap -threshold 150 -input ../astro/n101831-k113-mmsb-linksampling/communities.txt -output ../astro/direct
> java -jar AstroConverter.jar Overlap -threshold 200 -input ../astro/n101053-k113-mmsb-linksampling/communities.txt -output ../astro/coupling
> java -jar AstroConverter.jar Overlap -threshold 200 -input ../astro/n82895-k113-mmsb-linksampling/communities.txt -output ../astro/cocitation

Screenshot → Save → EPS (Encapsulated PostScript) files (*.eps)

load the file “subgraph.gml” in Gephi to visualize it. Use each node's “group” member and then edge's “color” member for “partitioning”. Use the bridgeness to size each node. (First, copy bridgeness to a new column with column type BigDecimal.)

> java -jar AstroConverter.jar Network -node 556 -top 4 -input ../astro/n101831-k113-mmsb-linksampling/gml/network.gml -output ../astro/n101831-k113-mmsb-linksampling/gml/subnetwork.gml
> java -jar AstroConverter.jar Network -node 2881 -top 3 -input ../astro/n101053-k113-mmsb-linksampling/gml/network.gml -output ../astro/n101053-k113-mmsb-linksampling/gml/subnetwork.gml
> java -jar AstroConverter.jar Network -node 354 -top 3 -input ../astro/n82895-k113-mmsb-linksampling/gml/network.gml -output ../astro/n82895-k113-mmsb-linksampling/gml/subnetwork.gml

Note that the option node is set to the first element in the first line of *.edgelist.giant.degree.

> python convert2csv.py ../astro/direct.docs.vocab ../astro/n101831-k113-mmsb-linksampling/communities.txt ../astro/n101831-k113-mmsb-linksampling/groups.txt
> python convert2csv.py ../astro/coupling.docs.vocab ../astro/n101053-k113-mmsb-linksampling/communities.txt ../astro/n101831-k113-mmsb-linksampling/groups.txt
> python convert2csv.py ../astro/cocitation.docs.vocab ../astro/n82895-k113-mmsb-linksampling/communities.txt ../astro/n82895-k113-mmsb-linksampling/groups.txt
> python coclustering.py coclustering ../astro/direct.csv ../astro/coupling.csv 5
> python coclustering.py coclustering ../astro/direct.csv ../astro/cocitation.csv 5
> python coclustering.py coclustering ../astro/cocitation.csv ../astro/coupling.csv 5
> python coclustering.py coclustering ../astro/hd.csv ../astro/direct.csv 5
> python coclustering.py coclustering ../astro/hd.csv ../astro/coupling.csv 5
> python coclustering.py coclustering ../astro/hd.csv ../astro/cocitation.csv 5
> python measure.py ../astro/direct.csv ../astro/coupling.csv 113
> python measure.py ../astro/direct.csv ../astro/cocitation.csv 113
> python measure.py ../astro/coupling.csv ../astro/cocitation.csv 113
> python measure.py ../astro/direct.csv ../astro/hd.csv 113
> python measure.py ../astro/coupling.csv ../astro/hd.csv 113
> python measure.py ../astro/cocitation.csv ../astro/hd.csv 113