Hierarchical clustering of bipartite data sets based on the statistical significance of coincidences

Ignacio Tamarit, María Pereda, and José A. Cuesta
Phys. Rev. E 102, 042304 – Published 6 October 2020

Abstract

When some ‘entities’ are related by the ‘features’ they share they are amenable to a bipartite network representation. Plant-pollinator ecological communities, co-authorship of scientific papers, customers and purchases, or answers in a poll, are but a few examples. Analyzing clustering of such entities in the network is a useful tool with applications in many fields, like internet technology, recommender systems, or detection of diseases. The algorithms most widely applied to find clusters in bipartite networks are variants of modularity optimization. Here, we provide a hierarchical clustering algorithm based on a dissimilarity between entities that quantifies the probability that the features shared by two entities are due to mere chance. The algorithm performance is O(n2) when applied to a set of n entities, and its outcome is a dendrogram exhibiting the connections of those entities. Through the introduction of a ‘susceptibility’ measure we can provide an ‘optimal’ choice for the clustering as well as quantify its quality. The dendrogram reveals further useful structural information though—like the existence of subclusters within clusters or of nodes that do not fit in any cluster. We illustrate the algorithm by applying it first to a set of synthetic networks, and then to a selection of examples. We also illustrate how to transform our algorithm into a valid alternative for one-mode networks as well, and show that it performs at least as well as the standard, modularity-based algorithms—with a higher numerical performance. We provide an implementation of the algorithm in python freely accessible from GitHub.

  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
  • Figure
1 More
  • Received 30 April 2020
  • Accepted 13 September 2020

DOI:https://doi.org/10.1103/PhysRevE.102.042304

©2020 American Physical Society

Physics Subject Headings (PhySH)

NetworksInterdisciplinary PhysicsStatistical Physics & Thermodynamics

Authors & Affiliations

Ignacio Tamarit1,2, María Pereda1,2,3, and José A. Cuesta1,2,4,5,*

  • 1Grupo Interdisciplinar de Sistemas Complejos (GISC), Departamento de Matemáticas de la Universidad Carlos III de Madrid, Leganés, Spain
  • 2Unidad Mixta Interdisciplinar de Comportamiento y Complejidad Social (UMICCS), Madrid, Spain
  • 3Grupo de Investigación Ingeniería de Organización y Logística (IOL), Escuela Técnica Superior de Ingenieros Industriales, Universidad Politécnica de Madrid, Madrid, Spain
  • 4Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), Universidad de Zaragoza, Zaragoza, Spain
  • 5UC3M-Santander Big Data Institute (IBiDat), Getafe, Spain

Article Text (Subscription Required)

Click to Expand

Supplemental Material (Subscription Required)

Click to Expand

References (Subscription Required)

Click to Expand
Issue

Vol. 102, Iss. 4 — October 2020

Reuse & Permissions
Access Options
Author publication services for translation and copyediting assistance advertisement

Authorization Required


×
×

Images

×

Sign up to receive regular email alerts from Physical Review E

Log In

Cancel
×

Search


Article Lookup

Paste a citation or DOI

Enter a citation
×