用户工具

站点工具


zh:notes:topic_extraction_mmsb

Overlapping Thematic Structures Extraction with Mixed-Membership Stochastic Blockmodel

Citation Information

Shuo Xu, Junwan Liu, Dongsheng Zhai, Xin An, Zheng Wang, and Hongshen Pang, 2018. Overlapping Thematic Structures Extraction with Mixed-Membership Stochastic Blockmodel. Scientometrics, Vol. 117, No. 1, pp. 61-84. Results

Notes

astro: data directory

tools: tool directory

doc: documents

Requirements

> java -jar AstroConverter.jar DirectCitation -input ../astro/direct_citations.txt -output ../astro/direct
> java -jar AstroConverter.jar BibliographicCoupling -threshold 4 -weight true -input ../astro/citation_links.txt -output ../astro/coupling
> java -jar AstroConverter.jar Cocitation -threshold 1 -weight true -input ../astro/citation_links.txt -output ../astro/cocitation

Extract giant component

> python extract_giant.py -weight false ../astro/direct.edgelist
> python extract_giant.py -weight true ../astro/coupling.edgelist
> python extract_giant.py -weight true ../astro/cocitation.edgelist

Discover overlapping communities

Direct Citation

> svinet -file astro/direct.edgelist.giant -n 101831 -k 101831 -eta-type fromdata -findk
> wc -l n101831-k101831-mmsb-findk/communities.txt
> svinet -file astro/direct.edgelist.giant -n 101831 -k 2396 -eta-type fromdata -link-sampling
> cd n101831-k2396-mmsb-linksampling
> svinet -file ../astro/direct.edgelist.giant -n 101831 -k 2396 -gml
 
> svinet -file astro/direct.edgelist.giant -n 101831 -k 113 -eta-type fromdata -link-sampling
> cd n101831-k113-mmsb-linksampling
> svinet -file ../astro/direct.edgelist.giant -n 101831 -k 113 -gml

Bibliographic Coupling

> svinet -file astro/coupling.edgelist.giant -n 101053 -k 101053 -eta-type fromdata -findk
> wc -l n101053-k101053-mmsb-findk/communities.txt
> svinet -file astro/coupling.edgelist.giant -n 101053 -k 992 -eta-type fromdata -link-sampling
> cd n101053-k992-mmsb-linksampling
> svinet -file ../astro/coupling.edgelist.giant -n 101053 -k 992 -gml
 
> svinet -file astro/coupling.edgelist.giant -n 101053 -k 113 -eta-type fromdata -link-sampling
> cd n101053-k113-mmsb-linksampling
> svinet -file ../astro/coupling.edgelist.giant -n 101053 -k 113 -gml

Co-citation

> svinet -file astro/cocitation.edgelist.giant -n 82895 -k 82895 -eta-type fromdata -findk
> wc -l n82895-k82895-mmsb-findk/communities.txt
> svinet -file astro/cocitation.edgelist.giant -n 82895 -k 634 -eta-type fromdata -link-sampling
> cd n82895-k634-mmsb-linksampling
> svinet -file ../astro/cocitation.edgelist.giant -n 82895 -k 634 -gml
 
> svinet -file astro/cocitation.edgelist.giant -n 82895 -k 113 -eta-type fromdata -link-sampling
> cd n82895-k113-mmsb-linksampling
> svinet -file ../astro/cocitation.edgelist.giant -n 82895 -k 113 -gml

Extract Terms with C-Value Methods

> java -jar AstroConverter.jar Term -input ../astro/astro-ALP-2003-2010.csv -output astro
> java -Xmx8g -XX:-UseGCOverheadLimit -cp jate-2.0-beta.1-jar-with-dependencies.jar uk.ac.shef.dcs.jate.app.AppCValue -corpusDir astro -c true -pf.mttf 3 -o cvalue-terms.json solr-testbed ACLRDTEC
> python extract_terms.py cvalue-terms.json

Topic Labeling

> python labeler.py hard cvalue-terms.json.terms ../astro/direct.docs.vocab ../astro/n101831-k113-mmsb-linksampling/communities.txt ../astro/direct.micro.txt
> python labeler.py hard cvalue-terms.json.terms ../astro/coupling.docs.vocab ../astro/n101053-k113-mmsb-linksampling/communities.txt ../astro/coupling.micro.txt
> python labeler.py hard cvalue-terms.json.terms ../astro/cocitation.docs.vocab ../astro/n82895-k113-mmsb-linksampling/communities.txt ../astro/cocitation.micro.txt
 
> python labeler.py soft cvalue-terms.json.terms ../astro/direct.docs.vocab ../astro/n101831-k113-mmsb-linksampling/communities.txt ../astro/n101831-k113-mmsb-linksampling/groups.txt
> python labeler.py soft cvalue-terms.json.terms ../astro/coupling.docs.vocab ../astro/n101053-k113-mmsb-linksampling/communities.txt ../astro/n101053-k113-mmsb-linksampling/groups.txt
> python labeler.py soft cvalue-terms.json.terms ../astro/cocitation.docs.vocab ../astro/n82895-k113-mmsb-linksampling/communities.txt ../astro/n82895-k113-mmsb-linksampling/groups.txt

Utilities

  • Gini index:
> index = GiniIndex('../astro/direct.edgelist.giant.degree', '../astro/coupling.edgelist.giant.degree', '../astro/cocitation.edgelist.giant.degree')
  • Statistics about uncovered communities: min, max, avg, median, and standard variance
> python statistics.py ../astro/n101831-k2396-mmsb-linksampling/communities.txt
> python statistics.py ../astro/n101053-k992-mmsb-linksampling/communities.txt
> python statistics.py ../astro/n82895-k634-mmsb-linksampling/communities.txt
 
> python statistics.py ../astro/n101831-k113-mmsb-linksampling/communities.txt
> python statistics.py ../astro/n101053-k113-mmsb-linksampling/communities.txt
> python statistics.py ../astro/n82895-k113-mmsb-linksampling/communities.txt
  • Distribution of the cluster size
> load cluster_size
> cluster_size_distribution(direct_2396(:, 2), 100, 100)
> cluster_size_distribution(coupling_992(:, 2), 100, 100)
> cluster_size_distribution(cocitation_634(:, 2), 100, 100)
> cluster_size_distribution(direct_113(:, 2), 50, 200)
> cluster_size_distribution(coupling_113(:, 2), 50, 200)
> cluster_size_distribution(cocitation_113(:, 2), 50, 200)

Q-Q (Quantile-Quantile) Plot: Analyze–>Discriptive Statistics–>Explore–>Plots, Normality plots with tests with selected.

  • Distribution of the number of memberships of nodes
> python membership.py ../astro/n101831-k113-mmsb-linksampling/communities.txt
> python membership.py ../astro/n101053-k113-mmsb-linksampling/communities.txt
> python membership.py ../astro/n82895-k113-mmsb-linksampling/communities.txt
 
> DistOfMemberships('../astro/n101831-k113-mmsb-linksampling/communities.dist')
> DistOfMemberships('../astro/n101053-k113-mmsb-linksampling/communities.dist')
> DistOfMemberships('../astro/n82895-k113-mmsb-linksampling/communities.dist')
  • Overlaps between thematic structures at a high level
> java -jar AstroConverter.jar Overlap -threshold 150 -input ../astro/n101831-k113-mmsb-linksampling/communities.txt -output ../astro/direct
> java -jar AstroConverter.jar Overlap -threshold 200 -input ../astro/n101053-k113-mmsb-linksampling/communities.txt -output ../astro/coupling
> java -jar AstroConverter.jar Overlap -threshold 200 -input ../astro/n82895-k113-mmsb-linksampling/communities.txt -output ../astro/cocitation

Screenshot → Save → EPS (Encapsulated PostScript) files (*.eps)

  • Extract subgraph for visulization

load the file “subgraph.gml” in Gephi to visualize it. Use each node's “group” member and then edge's “color” member for “partitioning”. Use the bridgeness to size each node. (First, copy bridgeness to a new column with column type BigDecimal.)

> java -jar AstroConverter.jar Network -node 556 -top 4 -input ../astro/n101831-k113-mmsb-linksampling/gml/network.gml -output ../astro/n101831-k113-mmsb-linksampling/gml/subnetwork.gml
> java -jar AstroConverter.jar Network -node 2881 -top 3 -input ../astro/n101053-k113-mmsb-linksampling/gml/network.gml -output ../astro/n101053-k113-mmsb-linksampling/gml/subnetwork.gml
> java -jar AstroConverter.jar Network -node 354 -top 3 -input ../astro/n82895-k113-mmsb-linksampling/gml/network.gml -output ../astro/n82895-k113-mmsb-linksampling/gml/subnetwork.gml

Note that the option node is set to the first element in the first line of *.edgelist.giant.degree.

  • Convert the solution to csv format
> python convert2csv.py ../astro/direct.docs.vocab ../astro/n101831-k113-mmsb-linksampling/communities.txt ../astro/n101831-k113-mmsb-linksampling/groups.txt
> python convert2csv.py ../astro/coupling.docs.vocab ../astro/n101053-k113-mmsb-linksampling/communities.txt ../astro/n101831-k113-mmsb-linksampling/groups.txt
> python convert2csv.py ../astro/cocitation.docs.vocab ../astro/n82895-k113-mmsb-linksampling/communities.txt ../astro/n82895-k113-mmsb-linksampling/groups.txt
  • Spectral co-clustering
> python coclustering.py coclustering ../astro/direct.csv ../astro/coupling.csv 5
> python coclustering.py coclustering ../astro/direct.csv ../astro/cocitation.csv 5
> python coclustering.py coclustering ../astro/cocitation.csv ../astro/coupling.csv 5
> python coclustering.py coclustering ../astro/hd.csv ../astro/direct.csv 5
> python coclustering.py coclustering ../astro/hd.csv ../astro/coupling.csv 5
> python coclustering.py coclustering ../astro/hd.csv ../astro/cocitation.csv 5
  • Normalized mutual information, adjusted mutual information and modification of partition coefficient
> python measure.py ../astro/direct.csv ../astro/coupling.csv 113
> python measure.py ../astro/direct.csv ../astro/cocitation.csv 113
> python measure.py ../astro/coupling.csv ../astro/cocitation.csv 113
> python measure.py ../astro/direct.csv ../astro/hd.csv 113
> python measure.py ../astro/coupling.csv ../astro/hd.csv 113
> python measure.py ../astro/cocitation.csv ../astro/hd.csv 113

zh/notes/topic_extraction_mmsb.txt · 最后更改: 2022/06/30 11:32 由 pzczxs