Fast and Scalable Method for Distributed Boolean Tensor Factorization

Namyong Park, Sejoon Oh, and U Kang

Overview

How can we analyze tensors that are composed of 0’s and 1’s? How can we efficiently analyze such Boolean tensors with millions or even billions of entries? Boolean tensors often represent relationship, membership, or occurrences of events such as subject-relation-object tuples in knowledge base data (e.g., ‘Seoul’-‘is the capital of’-‘South Korea’). Boolean tensor factorization (BTF) is a useful tool for analyzing binary tensors to discover latent factors from them. Furthermore, BTF is known to produce more interpretable and sparser results than normal factorization methods. Although several BTF algorithms exist, they do not scale up for large-scale Boolean tensors.
In this paper, we propose DBTF, a distributed method for Boolean CP (DBTF-CP) and Tucker (DBTF-TK) factorizations running on the Apache Spark framework. By distributed data generation with minimal network transfer, exploiting the characteristics of Boolean operations, and with careful partitioning, DBTF successfully tackles the high computational costs and minimizes the intermediate data.

Paper

Our proposed method DBTF is described in the following papers.

Code

  • DBTF v2.0 includes DBTF-CP and DBTF-TK presented in our VLDB journal paper. The source code is available on our [GitHub Repository]
  • DBTF v1.0 includes DBTF-CP presented in our ICDE paper. The binary code is avaiable here: [Download]

Datasets

Name Dimensionality Non-zeros Description Source Download
Facebook 64K × 64K × 870 1.5M Temporal relationship data between users
(User-User-Date)
Max Planck Institute
for Software Systems
Download
DBLP 418K × 3.5K × 50 1.3M DBLP publication data
(Author-Conference-Year)
DBLP Download
CAIDA-DDoS-S 9.3K × 9.3K × 3.9K 22.8M Network DDoS attack traffic (small)
(Source IP-Destination IP-Time)
Center for Applied
Internet Data Analysis
Download
CAIDA-DDoS-L 9.3K × 9.3K × 393K 331.8M Network DDoS attack traffic (large)
(Source IP-Destination IP-Time)
Center for Applied
Internet Data Analysis
Download
NELL-S 14.5K × 14.5K × 28.8K 76.9M Knowledge base data (small)
(Subject-Object-Predicate)
NELL Web
NELL-L 112K × 112K × 213K 17.9M Knowledge base data (large)
(Subject-Object-Predicate)
NELL Web
Synthetic-scalability 26~213 × 26~213 × 26~213 2.6K~5.5B Synthetic tensors
Synthetic-CP-error 100 × 100 × 100 6.5K~240K Synthetic tensors
Synthetic-TK-error 100 × 100 × 100 1.5K~45K Synthetic tensors

People