用户工具

站点工具


zh:notes:topic_extraction_mmsb

这是本文档旧的修订版!


Overlapping Thematic Structures Extraction with Mixed-Membership Stochastic Blockmodel

data source: Topic Extraction Challenge

astro: data directory

tools: tool directory

doc: documents

Tools

> java -jar AstroConverter.jar DirectCitation -input ../astro/direct_citations.txt -output ../astro/direct
> java -jar AstroConverter.jar BibliographicCoupling -threshold 4 -weight true -input ../astro/citation_links.txt -output ../astro/coupling
> java -jar AstroConverter.jar Cocitation -threshold 1 -weight true -input ../astro/citation_links.txt -output ../astro/cocitation

Extract giant component

> python extract_giant.py -weight false ../astro/direct.edgelist
> python extract_giant.py -weight true ../astro/coupling.edgelist
> python extract_giant.py -weight true ../astro/cocitation.edgelist

Discover overlapping communities

Direct Citation

> svinet -file astro/direct.edgelist.giant -n 101831 -k 101831 -eta-type fromdata -findk
> wc -l n101831-k101831-mmsb-findk/communities.txt
> svinet -file astro/direct.edgelist.giant -n 101831 -k 2396 -eta-type fromdata -link-sampling
> cd n101831-k2396-mmsb-linksampling
> svinet -file ../astro/direct.edgelist.giant -n 101831 -k 2396 -gml
 
> svinet -file astro/direct.edgelist.giant -n 101831 -k 113 -eta-type fromdata -link-sampling
> cd n101831-k113-mmsb-linksampling
> svinet -file ../astro/direct.edgelist.giant -n 101831 -k 113 -gml

Bibliographic Coupling

> svinet -file astro/coupling.edgelist.giant -n 101053 -k 101053 -eta-type fromdata -findk
> wc -l n101053-k101053-mmsb-findk/communities.txt
> svinet -file astro/coupling.edgelist.giant -n 101053 -k 992 -eta-type fromdata -link-sampling
> cd n101053-k992-mmsb-linksampling
> svinet -file ../astro/coupling.edgelist.giant -n 101053 -k 992 -gml
 
> svinet -file astro/coupling.edgelist.giant -n 101053 -k 113 -eta-type fromdata -link-sampling
> cd n101053-k113-mmsb-linksampling
> svinet -file ../astro/coupling.edgelist.giant -n 101053 -k 113 -gml

Co-citation

> svinet -file astro/cocitation.edgelist.giant -n 82895 -k 82895 -eta-type fromdata -findk
> wc -l n82895-k82895-mmsb-findk/communities.txt
> svinet -file astro/cocitation.edgelist.giant -n 82895 -k 634 -eta-type fromdata -link-sampling
> cd n82895-k634-mmsb-linksampling
> svinet -file ../astro/cocitation.edgelist.giant -n 82895 -k 634 -gml
 
> svinet -file astro/cocitation.edgelist.giant -n 82895 -k 113 -eta-type fromdata -link-sampling
> cd n82895-k113-mmsb-linksampling
> svinet -file ../astro/cocitation.edgelist.giant -n 82895 -k 113 -gml

Extract Terms with C-Value Methods

> java -jar AstroConverter.jar Term -input ../astro/astro-ALP-2003-2010.csv -output astro
> java -Xmx8g -XX:-UseGCOverheadLimit -cp jate-2.0-beta.1-jar-with-dependencies.jar uk.ac.shef.dcs.jate.app.AppCValue -corpusDir astro -c true -pf.mttf 3 -o cvalue-terms.json solr-testbed ACLRDTEC
> python extract_terms.py cvalue-terms.json

Topic Labeling

> python labeler.py hard cvalue-terms.json.terms ../astro/direct.docs.vocab ../astro/n101831-k113-mmsb-linksampling/communities.txt ../astro/direct.micro.txt
> python labeler.py hard cvalue-terms.json.terms ../astro/coupling.docs.vocab ../astro/n101053-k113-mmsb-linksampling/communities.txt ../astro/coupling.micro.txt
> python labeler.py hard cvalue-terms.json.terms ../astro/cocitation.docs.vocab ../astro/n82895-k113-mmsb-linksampling/communities.txt ../astro/cocitation.micro.txt
 
> python labeler.py soft cvalue-terms.json.terms ../astro/direct.docs.vocab ../astro/n101831-k113-mmsb-linksampling/communities.txt ../astro/n101831-k113-mmsb-linksampling/groups.txt
> python labeler.py soft cvalue-terms.json.terms ../astro/coupling.docs.vocab ../astro/n101053-k113-mmsb-linksampling/communities.txt ../astro/n101053-k113-mmsb-linksampling/groups.txt
> python labeler.py soft cvalue-terms.json.terms ../astro/cocitation.docs.vocab ../astro/n82895-k113-mmsb-linksampling/communities.txt ../astro/n82895-k113-mmsb-linksampling/groups.txt

Utilities

  • Gini index:
> index = GiniIndex('../astro/direct.edgelist.giant.degree', '../astro/coupling.edgelist.giant.degree', '../astro/cocitation.edgelist.giant.degree')
  • Statistics about uncovered communities: min, max, avg, and median
> python statistics.py ../astro/n101831-k2396-mmsb-linksampling/communities.txt
> python statistics.py ../astro/n101053-k992-mmsb-linksampling/communities.txt
> python statistics.py ../astro/n82895-k634-mmsb-linksampling/communities.txt
 
> python statistics.py ../astro/n101831-k113-mmsb-linksampling/communities.txt
> python statistics.py ../astro/n101053-k113-mmsb-linksampling/communities.txt
> python statistics.py ../astro/n82895-k113-mmsb-linksampling/communities.txt
  • Distribution of the number of memberships of nodes
> python membership.py ../astro/n101831-k113-mmsb-linksampling/communities.txt
> python membership.py ../astro/n101053-k113-mmsb-linksampling/communities.txt
> python membership.py ../astro/n82895-k113-mmsb-linksampling/communities.txt
 
> DistOfMemberships('../astro/n101831-k113-mmsb-linksampling/communities.dist')
> DistOfMemberships('../astro/n101053-k113-mmsb-linksampling/communities.dist')
> DistOfMemberships('../astro/n82895-k113-mmsb-linksampling/communities.dist')
  • Overlaps between thematic structures at a high level

Screenshot → Save → EPS (Encapsulated PostScript) files (*.eps) → PDF →

  • Extract subgraph for visulization

load the file “subgraph.gml” in Gephi to visualize it. Use each node's “group” member and then edge's “color” member for “partitioning”. Use the bridgeness to size each node. (First, copy bridgeness to a new column with column type BigDecimal.)

> java -jar AstroConverter.jar Network -node 556 -top 4 -input ../astro/n101831-k113-mmsb-linksampling/gml/network.gml -output ../astro/n101831-k113-mmsb-linksampling/gml/subnetwork.gml
> java -jar AstroConverter.jar Network -node 2881 -top 3 -input ../astro/n101053-k113-mmsb-linksampling/gml/network.gml -output ../astro/n101053-k113-mmsb-linksampling/gml/subnetwork.gml
> java -jar AstroConverter.jar Network -node 354 -top 3 -input ../astro/n82895-k113-mmsb-linksampling/gml/network.gml -output ../astro/n82895-k113-mmsb-linksampling/gml/subnetwork.gml

Note that the option node is set to the first element in the first line of *.edgelist.giant.degree.

  • Convert the solution to csv format
> python convert2csv.py ../astro/direct.docs.vocab ../astro/n101831-k113-mmsb-linksampling/communities.txt ../astro/n101831-k113-mmsb-linksampling/groups.txt
> python convert2csv.py ../astro/coupling.docs.vocab ../astro/n101053-k113-mmsb-linksampling/communities.txt ../astro/n101831-k113-mmsb-linksampling/groups.txt
> python convert2csv.py ../astro/cocitation.docs.vocab ../astro/n82895-k113-mmsb-linksampling/communities.txt ../astro/n82895-k113-mmsb-linksampling/groups.txt
  • Spectral co-clustering
> python coclustering.py coclustering ../astro/direct.csv ../astro/coupling.csv 5
> python coclustering.py coclustering ../astro/direct.csv ../astro/cocitation.csv 5
> python coclustering.py coclustering ../astro/cocitation.csv ../astro/coupling.csv 5
> python coclustering.py coclustering ../astro/hd.csv ../astro/direct.csv 5
> python coclustering.py coclustering ../astro/hd.csv ../astro/coupling.csv 5
> python coclustering.py coclustering ../astro/hd.csv ../astro/cocitation.csv 5
  • Normalized mutual information, adjusted mutual information and modification of partition coefficient
> python measure.py ../astro/direct.csv ../astro/coupling.csv 113
> python measure.py ../astro/direct.csv ../astro/cocitation.csv 113
> python measure.py ../astro/coupling.csv ../astro/cocitation.csv 113
> python measure.py ../astro/direct.csv ../astro/hd.csv 113
> python measure.py ../astro/coupling.csv ../astro/hd.csv 113
> python measure.py ../astro/cocitation.csv ../astro/hd.csv 113
  1. log-likelihood and Bayesian Information Criterion (BIC): likelihood.py astro.adjlist.giant n101831-k50-mmsb-linksampling
  2. comparing result in term of adjusted mutual information and normalized mutual information
    1. SolutionComparisonConverter.java
    2. mutual_info.py cluster_pair_file
zh/notes/topic_extraction_mmsb.1526098518.txt.gz · 最后更改: 2018/05/12 12:15 由 pzczxs