Accepted Papers

Scientific Track
Authors:
Christian Knoll, Graz University of Technology
Michael Rath, www.sbox.tugraz.at
Sebastian Tschiatschek, www.inf.ethz.ch
Franz Pernkopf, Graz University of Technology

Abstract:
Approximate inference in large and densely connected graphical models is a challenging but highly relevant problem. Belief propagation, as a method for performing approximate inference in loopy graphs, has shown empirical success in many applications. However, convergence of belief propagation can only be guaranteed for simple graphs. Whether belief propagation converges depends strongly on the applied message update scheme, and specialized schemes can be highly beneficial. Yet, residual belief propagation is the only established method utilizing this fact to improve convergence properties. In experiments, we observe that residual belief propagation fails to converge if local oscillations occur and the same sequence of messages is repeatedly updated. To overcome this issue, we propose two novel message update schemes. In the first scheme we add noise to oscillating messages. In the second scheme we apply weight decay to gradually reduce the influence of these messages and consequently enforce convergence. Furthermore, in contrast to previous work, we consider the correctness of the obtained marginals and observe significant performance improvements when applying the proposed message update schemes to various Ising models with binary random variables.

Scientific Track
Authors:
Niloofar Yousefi, University of Central Florida
Michael Georgiopoulos, UCF
Georgios Anagnostopoulos, Florida Institute of Technology

Abstract:
When faced with learning a set of inter-related tasks from a limitedamount of usable data, learning each task independently may lead to poor generalizationperformance. Multi-Task Learning (MTL) exploits the latent relationsbetween tasks and overcomes data scarcity limitations by co-learning all thesetasks simultaneously to offer improved performance. We propose a novel Multi-Task Multiple Kernel Learning framework based on Support Vector Machines forbinary classification tasks. By considering pair-wise task affinity in terms of similaritybetween a pair’s respective feature spaces, the new framework, comparedto other similar MTL approaches, offers a high degree of flexibility in determininghow similar feature spaces should be, as well as which pairs of tasks shouldshare a common feature space in order to benefit overall performance. The associatedoptimization problem is solved via a block coordinate descent, which employsa consensus-form Alternating Direction Method of Multipliers algorithmto optimize the Multiple Kernel Learning weights and, hence, to determine taskaffinities. Empirical evaluation on seven data sets exhibits a statistically significantimprovement of our framework’s results compared to the ones of severalother Clustered Multi-Task Learning methods.

Scientific Track
Authors:
Peng Luo, Northwest university of China
Jinye Peng, nwu of China
Ziyu Guan, nwu of China
Jianping Fan, www.nwu.edu.cn

Abstract:
Many real-world datasets are represented by multiple features or modalities which often provide compatible and complementary information to each other. In order to obtain a good data representation that synthesizes multiple features, researchers have proposed different multi-view subspace learning algorithms. Although label information has been exploited for guiding multi-view subspace learning, previous approaches either fail to directly capture the semantic relations between labeled items or unrealistically make Gaussian assumption about data distribution. In this paper, we propose a new multi-view nonnegative subspace learning algorithm called Multi-view Semantic Learning (MvSL). MvSL tries to capture the semantic structure of multi-view data by a novel graph embedding framework. The key idea is to let neighboring intra-class items be near each other while keep nearest inter-class items away from each other in the learned common subspace across multiple views. This nonparametric scheme can better model non-Gaussian data. To assess nearest neighbors in the multi-view context, we develop a multiple kernel learning method for obtaining a optimal kernel combination from multiple features. In addition, we encourage each latent dimension to be associated with a subset of views via sparseness constraints. In this way, MvSL is able to capture flexible conceptual patterns hidden in multi-view features. Experiments on two real-world datasets demonstrate the effectiveness of the proposed algorithm.

Scientific Track
Authors:
Adolfo Martinez-Uso, Tech University of Valencia
Hernandez-Orallo Jose, Polytechnic University of Valencia

Abstract:
Multidimensional data is systematically analysed at multiple granularities by applying aggregate and disaggregate operators (e.g., by the use of OLAP tools). For instance, in a supermarket we may want to predict sales of tomatoes for next week, but we may also be interested in predicting sales for all vegetables (higher up in the product hierarchy) for next Friday (lower down in the time dimension). While the domain and data are the same, the operating context is different.We explore several approaches for multidimensional data when predictions have to be made at different levels (or contexts) of aggregation. One method relies on the same resolution, another approach aggregates predictions bottom-up, a third approach disaggregates predictions top-down and a final technique corrects predictions using the relation between levels. We show how these strategies behave when the resolution context changes, using several machine learning techniques in four application domains.

Scientific Track
Authors:

Weixiang Shao, Lifang He, Philip Yu

Affiliation(s):
Univ. of Illinois at Chicago

Abstract:
With the advance of technology, data are often with multiple modalities or coming from multiple sources. Multi-view clustering provides a natural way for generating clusters from such data. Although multi-view clustering has been successfully applied in many applications, most of the previous methods assumed the completeness of each view (i.e., each instance appears in all views). However, in real-world applications, it is often the case that a number of views are available for learning but none of them is complete. The incompleteness of all the views and the number of available views make it difficult to integrate all the incomplete views and get a better clustering solution. In this paper, we propose MIC (Multi-Incomplete-view Clustering), an algorithm based on weighted nonnegative matrix factorization with L2,1 regularization. The proposed MIC works by learning the latent feature matrices for all the views and generating a consensus matrix so that the difference between each view and the consensus is minimized. MIC has several advantages comparing with other existing methods. First, MIC incorporates weighted nonnegative matrix factorization, which handles the missing instances in each incomplete view. Second, MIC uses a co-regularized approach, which pushes the learned latent feature matrices of all the views towards a common consensus. By regularizing the disagreement between the latent feature matrices and the consensus, MIC can be easily extended to more than two incomplete views. Third, MIC incorporates L2,1 regularization into the weighted nonnegative matrix factorization, which makes it robust to noises and outliers. Forth, an iterative optimization framework is used in MIC, which is scalable and proved to converge. Experiments on real datasets demonstrate the advantages of MIC.

Scientific Track
Authors:
Hoang-Vu Nguyen, Max Planck Institute for Informatics
Jilles Vreeken, Max Planck Institute for Informatics

Abstract:
Quantifying the difference between two distributions is a common problem in many machine learning and data mining tasks. What is also common in many tasks is that we only have empirical data. That is, we do not know the true distributions nor their form, and hence, before we can measure their divergence we first need to assume a distribution or perform estimation. For exploratory purposes this is unsatisfactory, as we want to explore the data, not our expectations.In this paper we study how to non-parametrically measure the divergence between two distributions. More in particular, we formalise the well-known Jensen-Shannon divergence using cumulative distribution functions. This allows us to calculate divergences directly and efficiently from data without the need for estimation. Moreover, empirical evaluation shows that our method performs very well in detecting differences between distributions, outperforming the state of the art in both statistical power and efficiency for a wide range of tasks.

Scientific Track
Authors:
Meelis Kull, University of Bristol
Peter Flach, University of Bristol

Abstract:
There are several reasons to evaluate a multi-class classifier on other measures than just error rate. Perhaps most importantly, there can be uncertainty about the exact context of classifier deployment, requiring the classifier to perform well with respect to a variety of contexts. This is commonly achieved by creating a scoring classifier which outputs posterior class probability estimates. Proper scoring rules are loss evaluation measures of scoring classifiers which are minimised at the true posterior probabilities. The well-known decomposition of the proper scoring rules into calibration loss and refinement loss has facilitated the development of methods to reduce these losses, thus leading to better classifiers. We propose multiple novel decompositions including one with four terms: adjustment loss, post-adjustment calibration loss, grouping loss and irreducible loss. The separation of adjustment loss from calibration loss requires extra assumptions which we prove to be satisfied for the most frequently used proper scoring rules: Brier score and log-loss. We propose algorithms to perform adjustment as a simpler alternative to calibration.

Scientific Track
Authors:
Alexander Ororbia, Pennsylvania State University
David Reitter, Pennsylvania State University
Jian Wu, Pennsylvania State University
C. Lee Giles, Pennsylvania State University

Abstract:
A hybrid architecture is presented capable of online learning from both labeled and unlabeled samples. It combines both generative and discriminative objectives to derive a new variant of the Deep Belief Network, i.e., the Stacked Boltzmann Experts Network model. The model's training algorithm is built on principles developed from hybrid discriminative Boltzmann machines and composes deep architectures in a greedy fashion. It makes use of its inherent ``layer-wise ensemble" nature to perform useful classification work. We (1) compare this architecture against a hybrid Denoising Auto-encoder version of itself as well as several other models and (2) investigate training in the context of an incremental learning procedure. The best-performing hybrid model, the Stacked Boltzmann Experts Network, consistently outperforms all others.

Scientific Track
Authors:
Marina Vidovic, TU Berlin
Nico Goernitz, TU Berlin
Klaus-Robert Mueller, TU Berlin
Gunnar Raetsch, MSKCC
Marius Kloft, HU Berlin

Abstract:
This work is in the context of kernel-based learning algorithms for sequence data. We present a probabilistic approach to automatically extract, from the output of such string-kernel-based learning algorithms, the subsequences---or motifs---truly underlying the machine's predictions. The proposed framework views motifs as free parameters in a probabilistic model, which is solved through a global optimization approach. In contrast to prevalent approaches, the proposed method can discover even difficult, long motifs, and could be combined with any kernel-based learning algorithm that is based on an adequate sequence kernel. We show that, by using a discriminate kernel machine such as a support vector machine, the approach can reveal discriminate motifs underlying the kernel predictor. We demonstrate the efficacy of our approach through a series of experiments on synthetic and real data, including problems from handwritten digit recognition and a large-scale human splice site data set from the domain of computational biology.

Scientific Track
Authors:
David Tolpin, University of Oxford
Jan-Willem van de Meent, University of Oxford
Brooks Paige, University of Oxford
Frank Wood, University of Oxford

Abstract:
We introduce an adaptive output-sensitive Metropolis-Hastings algorithm for probabilistic models expressed as programs, Adaptive Lightweight Metropolis-Hastings (AdLMH). The algorithm extends Lightweight Metropolis-Hastings (LMH) by adjusting the probabilities of proposing random variables for modification to improve convergence of the program output. We show that AdLMH converges to the correct equilibrium distribution and compare convergence of AdLMH to that of LMH on several test problems to highlight different aspects of the adaptation scheme. We observe consistent improvement in convergence on the test problems.

Scientific Track
Authors:
Sebastian Tschiatschek, ETH Zurich
Franz Pernkopf, Graz University of Technology

Abstract:
We consider online learning of Bayesian network classifiers (BNCs) with reduced-precision parameters, i.e. the conditional-probability tables parameterizing the BNCs are represented by low bit-width fixed-point numbers. In contrast to previous work, we analyze the learning of these parameters using reduced-precision arithmetic only which is important for computationally constrained platforms, e.g. embedded- and ambient-systems, as well as power-aware systems. This requires specialized algorithms since naive implementations of the projection for ensuring the sum-to-one constraint of the parameters in gradient-based learning are not sufficiently accurate. In particular, we present generative and discriminative learning algorithms for BNCs relying only on reduced-precision arithmetic. For several standard benchmark datasets, these algorithms achieve classification-rate performance close to that of BNCs with parameters learned by conventional algorithms using double-precision floating-point arithmetic. Our results facilitate the utilization of BNCs in the foresaid systems.

Scientific Track
Authors:

Davide Nitti, Vaishak Belle, Luc De Raedt

Affiliation(s):
KU Leuven
University of Toronto

Abstract:
Real-world planning problems frequently involve mixtures of continuous and discrete state variables and actions, and are formulated in environments with an unknown number of objects. In recent years, probabilistic programming has emerged as a natural approach to capture and characterize such complex probability distributions with general-purpose inference methods. While it is known that a probabilistic programming language can be easily extended to represent Markov Decision Processes (MDPs) for planning tasks, solving such tasks is challenging. Building on related efforts in reinforcement learning, we introduce a conceptually simple but powerful planning algorithm for MDPs realized as a probabilistic program. This planner constructs approximations to the optimal policy by importance sampling, while exploiting the knowledge of the MDP model.In our empirical evaluations, we show that this approach has wide applicability on domains ranging from strictly discrete to strictly continuous to hybrid ones, handles intricacies such as unknown objects, and is argued to be competitive given its generality.

Scientific Track
Authors:
Jinseok Nam, TU Darmstadt
Loza Mencia Eneldo, TU Darmstadt
Hyunwoo Kim, University of Wisconsin-Madison
Johannes Fürnkranz, TU Darmstadt

Abstract:
An important problem in multi-label classification is to capture label patterns or underlying structures that have an impact on such patterns.One way of learning underlying structures over labels is to project both instances and labels into the same space where an instance and its relevant labels tend to have similar representations.In this paper, we present a novel method to learn a joint space of instances and labels by leveraging a hierarchy of labels.We also present an efficient method for pretraining vector representations of labels, namely label embeddings, from large amounts of label co-occurrence patterns and hierarchical structures of labels.This approach also allows us to make predictions on labels that have not been seen during training.We empirically show that the use of pretrained label embeddings allows us to obtain higher accuracies on unseen labels even when the number of labels are quite large.Our experimental results also demonstrate qualitatively that the proposed method is able to learn regularities among labels by exploiting a label hierarchy as well as label co-occurrences.

Scientific Track
Authors:
Yahel David, Technion
Nahum Shimkin, Technion

Abstract:
We consider a variant of the Multi-Armed Bandit problem which involves a large pool of a-priory identical arms (or items). Each arm is associated with a deterministic value, which is sampled from a probability distribution with unknown maximal value, and is revealed once that arm is chosen. At each time instant the agent may choose a new arm (with unknown value), or a previously-chosen arm whose value is already revealed. The goal is to minimize the cumulative regret relative to the best arm in the pool. Previous work has established a lower bound on the regret for this model, depending on the functional form of the tail of the sample distribution, as well as algorithms that attain this bound up to logarithmic terms. Here, we present a more refined algorithm that attains the same order as the lower bound. We further consider several variants of the basic model, involving an anytime algorithm and the case of non-retainable arms. Numerical experiments demonstrate the superior performance of the suggested algorithms.

Scientific Track
Authors:
Wendelin Böhmer, TU-Berlin
Klaus Obermayer, TU-Berlin

Abstract:
Many applications that use empirically estimated functions face a curse of dimensionality, because integrals over most function classes must be approximated by sampling. This paper introduces a novel regression-algorithm that learns linear factored functions (LFF). This class of functions has structural properties that allow to analytically solve certain integrals and to calculate point-wise products. Applications like belief propagation and reinforcement learning can exploit these properties to break the curse and speed up computation. We derive a regularized greedy optimization scheme, that learns factored basis functions during training. The novel regression algorithm performs competitively to Gaussian processes on benchmark tasks, and the learned LFF functions are with 4-9 factored basis functions on average very compact.

Scientific Track
Authors:
Gonzalo Bello, North Carolina State University
Michael Angus, North Carolina State University
Navya Pedemane, North Carolina State University
Jitendra Harlalka, North Carolina State University
Fredrick Semazzi, North Carolina State University
Vipin Kumar, University of Minnesota
Nagiza Samatova, North Carolina State University

Abstract:
Discovering climate indices--time series that summarize spatiotemporal climate patterns--is a key task in the climate science domain. In this work, we approach this task as a problem of response-guided community detection; that is, identifying communities in a graph associated with a response variable of interest. To this end, we propose a general strategy for response-guided community detection that explicitly incorporates information of the response variable during the community detection process, and introduce a graph representation of spatiotemporal data that leverages information from multiple variables.We apply our proposed methodology to the discovery of climate indices associated with seasonal rainfall variability. Our results suggest that our methodology is able to capture the underlying patterns known to be associated with the response variable of interest and to improve its predictability compared to existing methodologies for data-driven climate index discovery and official forecasts.

Scientific Track
Authors:

Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, Yuji Matsumoto

Affiliation(s):
NAIST
The Institute of Statistical Mathematics
National Institute of Genetics

Abstract:
This paper discusses the effect of hubness in zero-shot learning, when ridge regression is used to find a mapping between the example space to the label space. Contrary to the existing approach, which attempts to find a mapping from the example space to the label space, we show that mapping labels into the example space is desirable to suppress the emergence of hubs in the subsequent nearest neighbor search step. Assuming a simple data model, we prove that the proposed approach indeed reduces hubness. This was verified empirically on the tasks of bilingual lexicon extraction and image labeling: hubness was reduced with both of these tasks and the accuracy was improved accordingly.

Scientific Track
Authors:
Shi Zhi, UIUC
Jiawei Han, UIUC
Quanquan Gu, University of Virginia

Abstract:
Graph regularization-based methods have achieved great success for network classification by making the label-link consistency assumption, i.e., if two nodes are linked together, they are likely to belong to the same class. However, in a real-world network, there exist links that connect nodes of different classes. These inconsistent links raise a big challenge for graph regularization and deteriorate the classification performance significantly. To address this problem, we propose a novel algorithm, namely Consistent Graph Learning, which is robust to the inconsistent links of a network. In particular, given a network and a small number of labeled nodes, we aim at learning a consistent network with more consistent and fewer inconsistent links than the original network. Since the link information of a network is naturally represented by a set of relation matrices, the learning of a consistent network is reduced to learning consistent relation matrices under some constraints. More specifically, we achieve it by joint graph regularization on the nuclear norm minimization of consistent relation matrices together with l1-norm minimization on the difference matrices between the original relation matrices and the learned consistent ones subject to certain constraints. Experiments on both homogeneous and heterogeneous network datasets show that the proposed method outperforms the state-of-the-art methods.

Scientific Track
Authors:

Changwei Hu, Piyush Rai, Changyou Chen, Matthew Harding, Lawrence Carin

Affiliation(s):
Duke University

Abstract:
We present a Bayesian non-negative tensor factorization model for count-valued tensor data, and develop scalable inference algorithms (both batch and online) for dealing with massive tensors. Our generative model can handle overdispersed counts as well as infer the rank of the decomposition. Moreover, leveraging a reparameterization of the Poisson distribution as a multinomial facilitates conjugacy in the model and enables simple Gibbs sampling and variational Bayes (VB) inference updates. We also develop a set of online inference algorithms that allow scaling up the model to massive tensors. We apply our framework on diverse real-world applications, such as analyzing a political science database, multiway topic modeling on a scientific publications database, and analyzing household transactions data.

Scientific Track
Authors:
Farzaneh Mirzazadeh, University of Alberta
Martha White, Indiana University
András György, University of Alberta
Dale Schuurmans, University of Alberta

Abstract:
We present a general formulation of metric learning for co-embedding, where the goal is to relate objects from different sets. The framework allows metric learning to be applied to a wide range of problems---including link prediction, relation learning, multi-label tagging and ranking---while allowing training to be reformulated as convex optimization. For training we provide a fast iterative algorithm that improves the scalability of existing metric learning approaches. Empirically, we demonstrate that the proposed method converges to a global optimum efficiently, and achieves competitive results in a variety of co-embedding problems such as multi-label classification and multi-relational prediction.

Scientific Track
Authors:
Jiwoong Im, University of Guelph
Graham Taylor, University of Guelph

Abstract:
Auto-encoders are perhaps the best-known non-probabilistic methods for representation learning. They are conceptually simple and easy to train. Recent theoretical work has shed light on their ability to capture manifold structure, and drawn connections to density modelling. This has motivated researchers to seek ways of auto-encoder scoring, which has furthered their use in classification. Gated auto-encoders (GAEs) are an interesting and flexible extension of auto-encoders which can learn transformations among different images or pixel covariances within images. However, they have been much less studied, theoretically or empirically. In this work, we apply a dynamical systems view to GAEs, deriving a scoring function, and drawing connections to Restricted Boltzmann Machines. On a set of deep learning benchmarks, we also demonstrate their effectiveness for single and multi-label classification.

Scientific Track
Authors:
Min Xiao, Temple University
Yuhong Guo, Temple University

Abstract:
Heterogeneous domain adaptation aims to exploit labeled training data from a source domain for learning prediction models in a target domain under the condition that the two domains have different input feature representation spaces. In this paper, we propose a novel semi-supervised subspace co-projection method to address multi-class heterogeneous domain adaptation. The proposed method projects the instances of the two domains into a co-located latent subspace to bridge the feature divergence gap across domains, while simultaneously training prediction models in the co-projected representation space with labeled training instances from both domains. It also exploits the unlabeled data to promote the consistency of co-projected subspaces from the two domains based on a maximum mean discrepancy criterion. Moreover, to increase the stability and discriminative informativeness of the subspace co-projection, we further exploit the error-correcting output code schemes to incorporate more binary prediction tasks shared across domains into the learning process. We formulate this semi-supervised learning process as a non-convex joint minimization problem and develop an alternating optimization algorithm to solve it. To investigate the empirical performance of the proposed approach, we conduct experiments on cross-lingual text classification and cross-domain digit image classification tasks with heterogeneous feature spaces. The experimental results demonstrate the efficacy of the proposed method on these heterogeneous domain adaptation problems.

Scientific Track
Authors:
Senjian An, University of Western Austrlia
Qiuhong Ke, www.yeah.net
Mohammed Bennamoun, www.uwa.edu.au
Farid Boussaid, www.uwa.edu.au
Ferdous Sohel, www.uwa.edu.au

Abstract:
In this paper we introduce sign constrained rectifier networks (SCRN), demonstrate their universal classification power and illustrate their applications to pattern decompositions. We prove that the proposed two-hidden-layer SCRN, with sign constraints on the weights of the output layer and on those of the top hidden layer, are capable of separating any two disjoint pattern sets. Furthermore, a two-hidden-layer SCRN of a pair of disjoint pattern sets can be used to decompose one of the pattern sets into several subsets so that each subset is convexly separable from the entire other pattern set; and a single-hidden-layer SCRN of a pair of convexly separable pattern sets can be used to decompose one of the pattern sets into several subsets so that each subset is linearly separable from the entire other pattern set. SCRN can thus be used to learn the pattern structures from the decomposed subsets of patterns and to analyse the discriminant factors of different patterns from the linear classifiers of the linearly separable subsets in the decompositions. With such pattern decompositions exhibiting convex separability or linear separability, users can also analyse the complexity of the classification problem, remove the outliers and the non-crucial points to improve the training of the traditional unconstrained rectifier networks in terms of both performance and efficiency.

Scientific Track
Authors:
Antonio Vergari, Università degli Studi di Bari
Nicola Di Mauro, University of Bari Aldo Moro
Esposito Floriana, Univesity of Bari Aldo Moro

Abstract:
The need for feasible inference in Probabilistic Graphical Models (PGMs) has lead to tractable models like Sum-Product Networks (SPNs). Their highly expressive power and their ability to provide exact and tractable inference make them very attractive for several real world applications, from computer vision to NLP. Recently, great attention around SPNs has focused on structure learning, leading to different algorithms being able to learn both the network and its parameters from data.Here, we enhance one of the best structure learner, LearnSPN, aiming to improve both the structural quality of the learned networks and their achieved likelihoods.Our algorithmic variations are able to learn simpler, deeper and more robust networks. These results have been obtained byexploiting some insights in the building process done by LearnSPN, by hybridizing the network adopting tree-structured models as leaves, and by blending bagging estimations into mixture creation. We prove our claims by empirically evaluating the learned SPNs on several benchmark datasets against other competitive SPN and PGM structure learners.

Scientific Track
Authors:
Steven Prestwich, University College Cork
Adejuyigbe Fajemisin, University College Cork
Laura Climent, University College Cork
Barry O'Sullivan, University College Cork

Abstract:
We are working with a company on a hard industrial optimisation problem: a version of the well-known Cutting Stock Problem in which a paper mill must cut rolls of paper following certain cutting patterns to meet customer demands. In our problem each roll to be cut may have a different size, there are multiple mills, and the cutting patterns are continuous and semi-automated, so that we have only indirect control over them via a list of parameters. We solve it using a combination of machine learning and optimisation techniques. First we learn the distribution of cutting patterns via Monte Carlo simulation. Secondly we cover the distribution by applying a k-medoids algorithm. Thirdly we use the results to build a MILP model which is then solved.

Scientific Track
Authors:
Michael Großhans, University of Potsdam
Tobias Scheffer, University of Potsdam

Abstract:
Learning problems in which an adversary can perturb instances at application time can be modeled as games with data-dependent cost functions. In an equilibrium point, the learner's model parameters are the optimal reaction to the data generator's perturbation, and vice versa. Finding an equilibrium point requires the solution of a difficult optimization problem for which both, the learner's model parameters and the possible perturbations are free parameters. We study a perturbation model and derive optimization procedures that use a single iteration of batch-parallel gradient descent and a subsequent aggregation step, thus allowing for parallelization with minimal synchronization overhead.

Scientific Track
Authors:
Sotirios Chatzis, Cyprus University of Technolog

Abstract:
Recurrent neural networks (RNNs) have recently gained renewed attention from the machine learning community as effective methods for modeling variable-length sequences. Language modeling, handwriting recognition, and speech recognition are only few of the application domains where RNN-based models have achieved the state-of-the-art performance currently reported in the literature. Typically, RNN architectures utilize simple linear, logistic, or softmax output layers to perform data modeling and prediction generation. In this work, for the first time in the literature, we consider using a sparse Bayesian regression or classification model as the output layer of RNNs, inspired from the automatic relevance determination (ARD) technique. The notion of ARD is to continually create new components while detecting when a component starts to overfit, where overfit manifests itself as a precision hyperparameter posterior tending to infinity. This way, our method manages to train sparse RNN models, where the number of effective (“active”) recurrently connected hidden units is selected in a data-driven fashion, as part of the model inference procedure. We develop efficient and scalable training algorithms for our model under the stochastic variational inference paradigm, and derive elegant predictive density expressions with computational costs comparable to conventional RNN formulations. We evaluate our approach considering its application to benchmark tasks dealing with both regression and classification problems, and exhibit its favorable performance over the state-of-the-art.

Scientific Track
Authors:
Ehsan Shareghi, Monash University
Gholamreza Haffari, Monash University
Trevor Cohn, University of Melbourne
Ann Nicholson, Monash University

Abstract:
Linguistic structures exhibit a rich array of global phenomena, however commonly used Markov models are unable to adequately describe these phenomena due to their strong locality assumptions. We propose a novel hierarchical model for structured prediction over sequences and trees which exploits global context by conditioning each generation decision on an unbounded context of prior decisions. This builds on the success of Markov models but without imposing a fixed bound in order to better represent global phenomena. To facilitate learning of this large and unbounded model, we use a hierarchical Pitman-Yor process prior which provides a recursive form of smoothing. We propose prediction algorithms based on A* and Markov Chain Monte Carlo sampling. Empirical results demonstrate the potential of our model compared to baseline finite-context Markov models on three tasks: morphological parsing, syntactic parsing and part-of-speech tagging.

Scientific Track
Authors:
Martin Ratajczak, TU Graz
Sebastian Tschiatschek, www.inf.ethz.ch
Franz Pernkopf, Graz University of Technology

Abstract:
We introduce both joint training of neural higher-order linear-chain conditional random fields (NHO-LC-CRFs) and a new structured regularizer for sequence modelling. We show that this regularizer can be derived as lower bound from a mixture of models sharing parts, e.g. neural sub-networks, and relate it to ensemble learning. Furthermore, it can be expressed explicitly as regularization term in the training objective.We exemplify its effectiveness by exploring the introduced NHO-LC-CRFs for sequence labeling. Higher-order LC-CRFs with linear factors are well-established for that task, but they lack the ability to model non-linear dependencies. These non-linear dependencies, however, can be efficiently modeled by neural higher-order input-dependent factors. Experimental results for phoneme classification with NHO-LC-CRFs confirm this fact and we achieve state-of-the-art phoneme error rate of 16.7% on TIMIT using the new structured regularizer.

Scientific Track
Authors:
Eyke Huellermeier, Univ. of Paderborn
Cheng Weiwei, Amazon

Abstract:
In standard supervised learning, each training instance is associated with an outcome from a corresponding output space (e.g., a class label in classification or a real number in regression). In the superset learning problem, the outcome is only characterized in terms of a superset---a subset of candidates that covers the true outcome but may also contain additional ones. Thus, superset learning can be seen as a specific type of weakly supervised learning, in which training examples are ambiguous. In this paper, we introduce a generic approach to superset learning, which is motivated by the idea of performing model identification and ``data disambiguation'' simultaneously. This idea is realized by means of a generalized risk minimization approach, using an extended loss function that compares precise predictions with set-valued observations. As an illustration, we instantiate our meta learning technique for the problem of label ranking, in which the output space consists of all permutations of a fixed set of items. The label ranking method thus obtained is compared to existing approaches tackling the same problem.

Scientific Track
Authors:
Nicolas MEGER, Université Savoie Mont Blanc
Christophe Rigotti, Université de Lyon
Catherine Pothier, Université de Lyon

Abstract:
Swap randomization has been shown to be an effective technique for assessing the significance of data mining results such as Boolean matrices, frequent itemsets, correlations, or clusterings.Basically, instead of applying statistical tests on selected attributes, the global structure of the actual dataset is taken into account by checking whether obtained results are likely or not to occur in randomized datasets whose column and row margins are equal to the ones of the actual dataset.In this paper, a swap randomization approach for bases of sequences is proposed with the aim of assessing sequential patterns extracted from Satellite Image Time Series (SITS).This assessment relies on the spatiotemporal locations of the extracted patterns. Using an entropy-based measure, the locations obtained on the actual dataset and a swap randomized dataset are compared.The potential and generality of the proposed approach is evidenced by experiments on both optical and radar SITS.

Scientific Track
Authors:
Evgeniy Bart, Palo Alto Research Center
Bob Price, Palo Alto Research Center
John Hanley, Palo Alto Research Center

Abstract:
The Temporally Coherent Role-Topic Model (TCRTM)is a probabilistic graphical modelfor analyzing overlapping, loosely temporally structured activitiesin heterogeneous populations.Such structure appears in many domainswhere activities have temporal coherence, but no strong ordering.For instance, editing a PowerPoint presentationmay involve opening files, typing text, anddownloading images. These events occur together in time,but without fixed ordering or duration.Further, several different activities may overlap --the user might check emailwhile editing the presentation.Finally, the user population has subgroups; for example,managers, salespeople andengineers have different activity distributions.TCRTM automatically infers an appropriate set of rolesand activity types, and segments users' event streamsinto high-level activity instance descriptions.On two real-world datasets involvingcomputer user monitoring anddebit card transactionswe show that TCRTMextracts semantically meaningful structure andimproves hold-out perplexity scoreby a factor of five compared tostandard models.

Scientific Track
Authors:
Eric Malmi, Aalto University
Arno Solin, www.aalto.fi
Gionis Aristides, Aalto University

Abstract:
We propose a probabilistic method for inferring the geographical locations of linked objects, such as users in a social network. Unlike existing methods, our model does not assume that the exact locations of any subset of the linked objects, like neighbors in a social network, are known. The method efficiently leverages prior knowledge on the locations, resulting in high geolocation accuracies even if none of the locations are initially known. Experiments are conducted for three scenarios: geolocating users of a location-based social network, geotagging historical church records, and geotagging Flickr photos. In each experiment, the proposed method outperforms two state-of-the-art network-based methods. Furthermore, the last experiment shows that the method can be employed not only to network-based but also to content-based location estimation.

Scientific Track
Authors:
Kailash Budhathoki, Cluster of Excellence MMCI
Jilles Vreeken, Max Planck Institute for Informatics

Abstract:
Suppose we are given a set of databases, such as sales records over different branches. How can we characterise the differences and the norm between these datasets? That is, what are the patterns that characterise the general distribution, and what are those that are important to describe the individual datasets? We study how to discover these pattern sets simultaneously and without redundancy -- automatically identifying those patterns that aid describing the overall distribution, as well as those pointing out those that are characteristic for specific databases. We define the problem in terms of the Minimum Description Length principle, and propose the DiffNorm algorithm to approximate the MDL-optimal summary directly from data. Empirical evaluation on synthetic and real-world data shows that DiffNorm efficiently discovers descriptions that accurately characterise the difference and the norm in easily understandable terms.

Scientific Track
Authors:
Haixia Liu, The University of Nottingham
James Goulding, University of Nottingham
Timothy Brailsford, The University of Nottingham

Abstract:
In this work we present a method for the computation of novel ideas from corpora of scientific text. The system functions by first detecting concept noun-phrases within the titles and abstracts of publications using Part-Of-Speech tagging, before classifying these into sets of problem and solution phrases via a target-word matching approach. By defining an idea as a co-occurring < problem,solution > pair, known-idea triples can be constructed through the additional assignment of a relevance value (computed via either idea occurrence or an idea frequency-inverse document frequency score). The resulting triples are then fed into a collaborative filtering algorithm, where problem phrases are considered as users and solution phrases as the items to be recommended. The final output is a ranked list of novel idea candidates, holding the potential for researchers to integrate into hypothesis generation processes. Our approach is evaluated using a subset of publications from the journal Science, with precision and recall results indicating that the system is capable of generating useful novel ideas in an automated fashion.

Scientific Track
Authors:

Sen Wang, Feiping Nie, Xiaojun Chang, Lina Yao, Xue Li, Quan Sheng

Affiliation(s):
University of Queensland
University of Adelaide

Abstract:
Unsupervised feature selection has been always attracting research attention in the communities of machine learning and data mining for decades. In this paper, we propose an unsupervised feature selection method seeking a feature coefficient matrix to select the most distinctive features. Specifically, our proposed algorithm integrates the Maximum Margin Criterion with a sparsity-based model into a joint framework, where the class margin and feature correlation are taken into account simultaneously. To maximize the total data separability while preserving minimized within-class scatter simultaneously, we propose to embed Kmeans into the framework generating pseudo-class label information in a scenario of unsupervised feature selection. Meanwhile, a sparsity-based model, ` 2 ,p-norm, is imposed to the regularization term to effectively discover the sparse structures of the feature coefficient matrix. In this way, noisy and irrelevant features are removed by ruling out those features whose corresponding coefficients are zeros. To alleviate the local optimum problem which is caused by random initializations of K-means, a convergence guaranteed algorithm with an updating strategy for the clustering indicator matrix, is proposed to iteratively chase the optimal solution. Performance evaluation is extensively conducted over six benchmark data sets. From plenty of experimental results, it is demonstrated that our method has superior performance against all other compared approaches.

Scientific Track
Authors:
Reem Al-Otaibi, University of Bristol
Ricardo Prudencio, Federal University of Pernambuco
Meelis Kull, University of Bristol
Peter Flach, University of Bristol

Abstract:
Discriminative models for classification assume that training and deployment data are drawn from the same distribution. The performance of these models can vary significantly when they are learned and deployed in differentcontexts with different data distributions. In the literature, this phenomenon is called dataset shift. In this paper, we address several important issues in the dataset shift problem. First, how can we automatically detect that there is a significant difference between training and deployment data to take action or adjustthe model appropriately? Secondly, different shifts can occur in real applications (e.g., linear and non-linear), which require the use of diverse solutions. Thirdly, how should we combine the original model of the training data with other models to achieve better performance? This work offers two main contributions towards these issues. We propose a Versatile Model that is rich enough to handle different kinds of shift without making strong assumptions such as linearity, and furthermore does not require labelled data to identify the data shift at deployment. Empirical results on both synthetic shift and real datasets shift show strong performance gains by achieved the proposed model.

Scientific Track
Authors:
Sascha Henzgen, www.upb.de
Eyke Huellermeier, Univ. of Paderborn

Abstract:
Measures of rank correlation are commonly used in statistics to capture the degree of concordance between two orderings of the same set of items. Standard measures like the well-known Kendall tau and Spearman rho coefficients put equal emphasis on each position of a ranking. Yet, motivated by applications in which some of the positions (typically those on the top) are more important than others, a few weighted variants of these measures have been proposed. Most of these generalizations fail to meet desirable formal properties, however. Besides, they are often quite inflexible in the sense of committing to a fixed weighing scheme. In this paper, we propose a weighted rank correlation measure on the basis of fuzzy order relations. Our measure is parametrized by a fuzzy equivalence relation on the rank positions, which in turn is specified conveniently by a so-called scaling function. This approach combines soundness with flexibility: it has a sound formal foundation and allows for weighing rank positions in a flexible way. The usefulness of our class of weighted rank correlation measures is illustrated by means of an experimental study.

Scientific Track
Authors:
Andrea Dal pozzolo, Universitè libre de bruxelles
Olivier Caelen, worldline
Bontempi Gianluca, Universite Libre de Bruxelles

Abstract:
A well-known rule of thumb in unbalanced classification recommends the rebalancing (typically by resampling) of the classes before proceeding with the learning of the classifier. Though this seems to work for the majority of cases, no detailed analysis exists about the impact of undersampling on the accuracy of the final classifier. This paper aims to fill this gap by proposing an integrated analysis of the two elements which have the largest impact on the effectiveness of an undersampling strategy: the increase of the variance due to the reduction of the number of samples and the warping of the posterior distribution due to the change of priori probabilities. In particular we will propose a theoretical analysis specifying under which conditions undersampling is recommended and expected to be effective. It emerges that the impact of undersampling depends on the number of samples, the variance of the classifier, the degree of unbalancedness and more specifically on the value of the posterior probability. This makes difficult to predict the average effectiveness of an undersampling strategy since its benefits depend on the distribution of the testing points. Results from several synthetic and real-world unbalanced datasets support and validate our findings.