Scientific Track

Generalization in Unsupervised Learning

We are interested in the following questions. Given a finite data set S, with neither labels nor side information, and an unsupervised learning algorithm A, can the generalization of A be assessed on S? Similarly, given two unsupervised learning algorithms, A1 and A2, for the same learning task, can one assess whether one will generalize "better'' on future data drawn from the same source as S? In this paper, we develop a general approach to answering these questions in a reliable and efficient manner using mild assumptions on A.

Gamma Process Poisson Factorization for Joint Modeling of Network and Documents

Developing models to discover, analyze, and predict clusters within networked entities is an area of active and diverse research. However, many of the existing approaches do not take into consideration pertinent auxiliary information. This paper introduces Joint Gamma Process Poisson Factorization (J-GPPF) to jointly model network and side-information. J-GPPF naturally fits sparse networks, accommodates separately clustered side information in a principled way, and effectively addresses the computational challenges of analyzing large networks.

Discovering Opinion Spammer Groups by Network Footprints

Online reviews are an important source for consumers to evaluate products/services on the Internet (e.g. Amazon, Yelp, etc.). However, more and more fraudulent reviewers write fake reviews to mis-lead users. To maximize their impact and share effort, many spam attacks are organized as campaigns, by a group of spammers. In this paper, we propose a new two-step method to discover spammer groups and their targeted products. First, we introduce NFS (Network Footprint Score), a new measure that quantifies the likelihood of products being spam campaign targets.

ConDist: A Context-Driven Categorical Distance Measure

A distance measure between objects is a key requirement for many data mining tasks like clustering, classification or outlier detection.

Bayesian Active Clustering with Pairwise Constraints

Clustering can be improved with pairwise constraints that specify similarities between pairs of instances. However, randomly selecting constraints could lead to the waste of labeling effort, or even degrade the clustering performance. Consequently, how to actively select effective pairwise constraints to improve clustering becomes an important problem, which is the focus of this paper. In this work, we introduce a Bayesian clustering model that learns from pairwise constraints.

A kernel-learning approach to semi-supervised clustering with relative distance comparisons

We consider the problem of clustering a given dataset into k clusters subject to an additional set of constraints on relative distance comparisons between the data items.The additional constraints are meant to reflect side-information that is not expressed in the feature vectors, directly.Relative comparisons can express structures at finer level of detail than must-link (ML) and cannot-link (CL) constraints that are commonly used for semi-supervised clustering.Relative comparisons are particularly useful in settings where giving an ML or a CL constraint is difficult because the granularity of

Why is undersampling effective in unbalanced classification tasks?

A well-known rule of thumb in unbalanced classification recommends the rebalancing (typically by resampling) of the classes before proceeding with the learning of the classifier. Though this seems to work for the majority of cases, no detailed analysis exists about the impact of undersampling on the accuracy of the final classifier.

Versatile Decision Trees for Learning over Multiple Contexts

Discriminative models for classification assume that training and deployment data are drawn from the same distribution. The performance of these models can vary significantly when they are learned and deployed in differentcontexts with different data distributions. In the literature, this phenomenon is called dataset shift. In this paper, we address several important issues in the dataset shift problem. First, how can we automatically detect that there is a significant difference between training and deployment data to take action or adjustthe model appropriately?

Structured Regularizer for Neural Higher-Order Sequence Models

We introduce both joint training of neural higher-order linear-chain conditional random fields (NHO-LC-CRFs) and a new structured regularizer for sequence modelling. We show that this regularizer can be derived as lower bound from a mixture of models sharing parts, e.g. neural sub-networks, and relate it to ensemble learning. Furthermore, it can be expressed explicitly as regularization term in the training objective.We exemplify its effectiveness by exploring the introduced NHO-LC-CRFs for sequence labeling.

Solving Prediction Games with Parallel Batch Gradient Descent

Learning problems in which an adversary can perturb instances at application time can be modeled as games with data-dependent cost functions. In an equilibrium point, the learner's model parameters are the optimal reaction to the data generator's perturbation, and vice versa. Finding an equilibrium point requires the solution of a difficult optimization problem for which both, the learner's model parameters and the possible perturbations are free parameters.