Accepted Papers

Machine Learning
Authors:

Giorgio Corani, Alessio Benavoli

Affiliation(s):
Instituto Dalle Molle di studi sull’Intelligenza Artificiale (IDSIA)
Scuola universitaria professionale della Svizzera italiana (SUPSI)
Universita della Svizzera italiana (USI)

Abstract:
We present a Bayesian approach for making statistical inference about the accuracy (or any other score) of two competing algorithms which have been assessed via cross-validation on multiple data sets.The approach is constituted by two pieces. The first is a novel correlated Bayesian t-test for the analysis of the cross-validation results on a single data set which accounts for the correlation due to the overlapping training sets. The second piece merges the posterior probabilities computed by the Bayesian correlated t-test on the different data sets to make inference on multiple data sets. It does so by adopting a Poisson-binomial model. The inferences on multiple data sets account for the different uncertainty of the cross-validation results on the different data sets. It is the first test able to achieve this goal. It is generally more powerful than the signed-rank test if ten runs of cross-validation are performed, as it is anyway generally recommended.

Machine Learning
Authors:

Heiko Paulheim, Robert Meusel

Affiliation(s):
University of Mannheim, GERMANY

Abstract:
Outlier detection methods automatically identify instances that deviate from the majority of the data. In this paper, we propose a novel approach for unsupervised outlier detection, which re-formulates the outlier detection problem in numerical data as a set of supervised regression learning problems. For each attribute, we learn a predictive model which predicts the values of that attribute from the values of all other attributes, and compute the deviations between the predictions and the actual values.From those deviations, we derive both a weight for each attribute, and a final outlier score using those weights.The weights help separating the relevant attributes from the irrelevant ones, and thus make the approach well suitable for discovering outliers otherwise masked in high-dimensional data.An empirical evaluation shows that our approach outperforms existing algorithms, and is particularly robust in datasets with many irrelevant attributes.Furthermore, we show that if a symbolic machine learning method is used to solve the individual learning problems, the approach is also capable of generating concise explanations for the detected outliers.

Scientific Track
Authors:

Ehsan Amid, Gionis Aristides, Ukkonen Antti

Affiliation(s):
Aalto University
Finnish Institute of Occupational Health

Abstract:
We consider the problem of clustering a given dataset into k clusters subject to an additional set of constraints on relative distance comparisons between the data items.The additional constraints are meant to reflect side-information that is not expressed in the feature vectors, directly.Relative comparisons can express structures at finer level of detail than must-link (ML) and cannot-link (CL) constraints that are commonly used for semi-supervised clustering.Relative comparisons are particularly useful in settings where giving an ML or a CL constraint is difficult because the granularity of the true clustering is unknown.Our main contribution is an efficient algorithm for learning a kernel matrix using the log determinant divergence(a variant of the Bregman divergence)subject to a set of relative distance constraints.Given the learned kernel matrix,a clustering can be obtained by any suitable algorithm, such as kernel k-means.We show empirically that kernels found by our algorithm yield clusterings of higher quality than existing approaches that either use ML/CL constraints or a different means to implement the supervision using relative comparisons.

Scientific Track
Authors:

Tinh Tran, Alex Aussem

Affiliation(s):
University Lyon 1

Abstract:
Covariate shift is a specific class of selection bias that arises when the marginal distributions of the input features X are different in the source and the target domains while the conditional distributions of the target Y given X are the same. A common technique to deal with this problem, called importance weighting, amounts to reweighting the training instances in order to make them resemble the test distribution. However this usually comes at the expense of a reduction of the effective sample size. In this paper, we show analytically that, while the unweighted model is globally more biased than the weighted one, it may locally be less biased on low importance instances. In view of this result, we then discuss a manner to optimally combine the weighted and the unweighted models in order to improve the predictive performance in the target domain. We conduct a series of experiments on synthetic and real-world data to demonstrate the efficiency of this approach.

Scientific Track
Authors:
Zhanxing Zhu, University of Edinburgh
Amos Storkey, University of Edinburgh

Abstract:
We consider a generic convex-concave saddle point problem with a separable structure, a form that covers a wide-ranged machine learning applications. Under this problem structure, we follow the framework of primal-dual updates for saddle point problems, and incorporate stochastic block coordinate descent with adaptive stepsizes into this framework. We theoretically show that our proposal of adaptive stepsizes potentially achieves a sharper linear convergence rate compared with the existing methods. Additionally, since we can select ``mini-batch" of block coordinates to update, our method is also amenable to parallel processing for large-scale data. We apply the proposed method to regularized empirical risk minimization and show that it performs comparably or, more often, better than state-of-the-art methods on both synthetic and real-world data sets.

Scientific Track
Authors:
Sebastian Wagner, Ludwig-Maximilians-University
Max Zimmermann, University of Magdeburg
Ntoutsi Eirini, Ludwig Maximilian Univ. Munich
Myra Spiliopoulou, University Magdeburg

Abstract:
The long-term analysis of opinionated streams requires algorithms that predict the polarity of opinionated documents, while adapting to different forms of concept drift: the class distribution may change but also the vocabulary used by the document authors may change.One of the key properties of a stream classifier isadaptation to concept drifts and shifts;this is typically achieved through ageing of the data.Surprisingly, for one of the most popular classifiers, Multinomial Naive Bayes (MNB), no ageinghas been considered thus far.MNB is particularly appropriate for opinionated streams, because it allows the seamless adjustment of word probabilities, as new words appear for the first time. However, to adapt properly to drift, MNB must also be extended to take the age of documents and words into account.In this study, we incorporate ageing into the learning process of MNB, by introducing the notion of fading for words, on the basis of the recency of the documents containing them. We propose two fading versions, gradual fading and aggressive fading, of which the latter discards old data at a faster pace.Our experiments with Twitter data show that the ageing based MNBs outperform the standard accumulative MNB approach and manage to recover very fast in times of change.We experiment with different data granularities in the stream and different data ageing degrees and we show how they ``work together" towards adaptation to change.

Scientific Track
Authors:
Amos Storkey, University of Edinburgh
Zhanxing Zhu, University of Edinburgh
Jinli Hu, University of Edinburgh

Abstract:
Trading in information markets, such as machine learning markets, has been shown to be an effective approach for aggregating the beliefs of different agents. In a machine learning context, aggregation commonly uses forms of linear opinion pools, or log opinion pools. It is interesting to relate information market aggregation to the machine learning setting.In this paper we introduce a spectrum of compositional methods, Renyi divergence aggregators, that interpolate between log opinion pools and linear opinion pools. We show that these compositional methods are maximum entropy distributions for aggregating information from agents subject to individual biases, with the Renyi divergence parameter dependent on the bias. In the limit of no bias this reduces to the optimal limit of log opinion pools. We demonstrate this relationship practically on both simulated and real datasets.We then return to information markets and show that Renyi divergence aggregators are directly implemented by machine learning markets with isoelastic utilities, and so can result from autonomous self interested decision making by individuals contributing different predictors. The risk averseness of the isoelastic utility directly relates to the Renyi divergence parameter, and hence encodes how much an agent believes (s)he may be subject to an individual bias that could affect the trading outcome: if an agent believes (s)he might be acting on significantly biased information, a more risk averse isoelastic utility is warranted.

Scientific Track
Authors:
Jiwoong Im, University of Guelph
Ethan Buchman, University of Guelph
Graham Taylor, University of Guelph

Abstract:
Energy-based models are popular in machine learning due to the elegance of their formulation and their relationship to statistical physics. Among these, the Restricted Boltzmann Machine (RBM), and its staple training algorithm contrastive divergence (CD), have been the prototype for some recent advancements in the unsupervised training of deep neural networks. However, CD has limited theoretical motivation, and can in some cases produce undesirable behaviour.Here, we investigate the performance of Minimum Probability Flow (MPF) learning for training RBMs. Unlike CD, with its focus on approximating an intractable partition function via Gibbs sampling, MPF proposes a tractable, consistent, objective function defined in terms of a Taylor expansionof the KL divergence with respect to sampling dynamics. Here we propose a more general form for the sampling dynamics in MPF,and explore the consequences of different choices for these dynamics for training RBMs.Experimental results show MPF outperforming CD for various RBM configurations.

Data Mining and Knowledge Discovery
Authors:
Vasileios Lampos, University College London
Elad Yom-Tov, Microsoft Research
Richard Pebody, Public Health England
Ingemar J. Cox, University of Copenhagen

Abstract:
Assessing the effect of a health-oriented intervention by traditional epidemiological methods is commonly based only on population segments that use healthcare services. Here we introduce a complementary framework for evaluating the impact of a targeted intervention, such as a vaccination campaign against an infectious disease, through a statistical analysis of user-generated content submitted on web platforms. Using supervised learning, we derive a nonlinear regression model for estimating the prevalence of a health event in a population from Internet data. This model is applied to identify control location groups that correlate historically with the areas, where a specific intervention campaign has taken place. We then determine the impact of the intervention by inferring a projection of the disease rates that could have emerged in the absence of a campaign. Our case study focuses on the influenza vaccination program that was launched in England during the 2013/14 season, and our observations consist of millions of geo-located search queries to the Bing search engine and posts on Twitter. The impact estimates derived from the application of the proposed statistical framework support conventional assessments of the campaign.

Industrial Track
Authors:
Enda Barrett, Schneider Electric
Stephen Linder, Schneider Electric

Abstract:
Recent high profile developments of autonomous learning thermostats by companies such as Nest Labs and Honeywell have brought to the fore the possibility of ever greater numbers of intelligent devices permeating our homes and working environments into the future. However, the specific learning approaches and methodologies utilised by these devices have never been made public. In fact little information is known as to the specifics of how these devices operate and learn about their environments or the users who use them. This paper proposes a suitable learning architecture for such an intelligent thermostat in the hope that it will benefit further investigation by the research community. Our architecture comprises a number of different learning methods each of which contributes to create a complete autonomous thermostat capable of controlling a HVAC system. A novel state action space formalism is proposed to enable a Reinforcement Learning agent to successfully control the HVAC system by optimising both occupant comfort and energy costs. Our results show that the learning thermostat can achieve cost savings of 10% over a programmable thermostat, whilst maintaining high occupant comfort standards.

Scientific Track
Authors:
Yuanli Pei, Oregon State University
Li-Ping Liu, Oregon State University
Xiaoli Fern, Oregon State University

Abstract:
Clustering can be improved with pairwise constraints that specify similarities between pairs of instances. However, randomly selecting constraints could lead to the waste of labeling effort, or even degrade the clustering performance. Consequently, how to actively select effective pairwise constraints to improve clustering becomes an important problem, which is the focus of this paper. In this work, we introduce a Bayesian clustering model that learns from pairwise constraints. With this model, we present an active learning framework that iteratively selects the most informative pair of instances to query an oracle, and updates the model posterior based on the obtained pairwise constraints. We introduce two information-theoretic criteria for selecting informative pairs. One selects the pair with the most uncertainty, and the other chooses the pair that maximizes the marginal information gain about the clustering. Experiments on benchmark datasets demonstrate the effectiveness of the proposed method over state-of-the-art.

Nectar Track
Authors:

Giorgio Corani, Alessio Benavoli, Francesca Mangili, Marco Zaffalon

Affiliation(s):
IDSIA

Abstract:
Most hypothesis testing in machine learning is done using the frequentist null-hypothesis significance test, which has severe drawbacks. We review recent Bayesian tests which overcome the drawbacks of the frequentist ones.

Scientific Track
Authors:

Tom Diethe, Niall Twomey, Peter Flach

Affiliation(s):
University of Bristol

Abstract:
Typically, when analysing patterns of activity in a smart home environment, the daily patterns of activity are either ignored completely or summarised into a high-level "hour-of-day" feature that is then combined with sensor activities. However, when summarising the temporal nature of an activity into a coarse feature such as this, not only is information lost after discretisation, but also the strength of the periodicity of the action is ignored. We propose to model the temporal nature of activities using circular statistics, and in particular by performing Bayesian inference with Wrapped Normal (WN) and WN Mixture (WNM) models. We firstly demonstrate the accuracy of inference on toy data using both Gibbs sampling and Expectation Propagation (EP), and then show the results of the inference on publicly available smart-home data. Such models can be useful for analysis or prediction in their own right, or can be readily combined with larger models incorporating multiple modalities of sensor activity.

Data Mining and Knowledge Discovery
Authors:

Eric Malmi, Nikolaj Tatti, Aristides Gionis

Affiliation(s):
Aalto University, Espoo, Finland

Abstract:
Defining appropriate distance measures among rankings is a classic area of study which has led to many useful applications. In this paper, we propose a more general abstraction of preference data, namely directed acyclic graphs (DAGs), and introduce a measure for comparing DAGs, given that a vertex correspondence between the DAGs is known. We study the properties of this measure and use it to aggregate and cluster a set of DAGs. We show that these problems are NP-hard and present efficient methods to obtain solutions with approximation guarantees. In addition to preference data, these methods turn out to have other interesting applications, such as the analysis of a collection of information cascades in a network. We test the methods on synthetic and real-world datasets, showing that the methods can be used to, e.g., find a set of influential individuals related to a set of topics in a network or to discover meaningful and occasionally surprising clustering structure.

Scientific Track
Authors:
Nipa Chowdhury, UNSW
Xiongcai Cai, UNSW
Cheng Luo, UNSW

Abstract:
Personalised recommender systems are widely used information filtering for information retrieval, where matrix factorisation (MF) has become popular as a model-based approach to personalised recommendation. Classical MF methods, which directly approximate low rank factor matrices by minimising some rating prediction criteria, do not achieve a satisfiable performance for the task of top-N recommendation. In this paper, we propose a novel MF method, namely BoostMF, that formulates factorisation as a learning problem and integrates boosting into factorisation. Rather than using boosting as a wrapper, BoostMF directly learns latent factors that are optimised toward the top-N recommendation. The proposed method is evaluated against a set of state-of-the-art methods on three popular public benchmark datasets. The experimental results demonstrate that the proposed method achieves significant improvement over these baseline methods for the task of top-N recommendation.

Demo
Authors:

Andre Lourenco, Ana Priscila Alves, Carlos Carreiras, Rui Duarte, Ana Fred

Affiliation(s):
Instituto de Telecomunicacoes
CardioID - Technologies LDA

Abstract:
Monitoring of physiological signals while driving is a recenttrend in the automotive industry. CardioWheel is a state-of-the-artmachine learning solution for driver biometrics based on electrocardiographicsignals (ECG). The system pervasively acquires heart signalsfrom the hands of the user through sensors embedded in the steeringwheel, allowing the recognition of the driver's identity. The implementedpipeline combines unsupervised and supervised machine learning algorithms,and is being tested in real-world scenarios, illustrating one of thepotential uses of this technology.

Data Mining and Knowledge Discovery
Authors:

Saskia Metzler, Pauli Miettinen

Affiliation(s):
Max-Planck-Institut fur Informatik

Abstract:
Graphs - such as friendship networks - that evolve over time are an example of data that are naturally represented as binary tensors. Similarly to analysing the adjacency matrix of a graph using a matrix factorization, we can analyse the tensor by factorizing it. Unfortunately, tensor factorizations are computationally hard problems, and in particular, are often significantly harder than their matrix counterparts. In case of Boolean tensor factorizations - where the input tensor and all the factors are required to be binary and we use Boolean algebra - much of that hardness comes from the possibility of overlapping components. Yet, in many applications we are perfectly happy to partition at least one of the modes. For instance, in the aforementioned timeevolving friendship networks, groups of friends might be overlapping, but the time points at which the network was captured are always distinct. In this paper we investigate what consequences this partitioning has on the computational complexity of the Boolean tensor factorizations and present a new algorithm for the resulting clustering problem. This algorithm can alternatively be seen as a particularly regularized clustering algorithm that can handle extremely high-dimensional observations. We analyse our algorithm with the goal of maximizing the similarity and argue that this is more meaningful than minimizing the dissimilarity. As a by-product we obtain a PTAS and an efficient 0.828-approximation algorithm for rank-1 binary factorizations. Our algorithm for Boolean tensor clustering achieves high scalability, high similarity, and good generalization to unseen data with both synthetic and realworld data sets.

Industrial Track
Authors:

George Forman, Hila Nachlieli, Renato Keshet

Affiliation(s):
Hewlett-Packard Labs

Abstract:
Our business users have often been frustrated with clustering results that do not suit their purpose; when trying to discover clusters of product complaints, the algorithm may return clusters of product models instead. The fundamental issue is that complex text data can be clustered in many different ways, and, really, it is optimistic to expect relevant clusters from an unsupervised process, even with parameter tinkering.We studied this problem in an interactive context and developed an effective solution that re-casts the problem formulation, radically different from traditional or semi-supervised clustering. Given training labels of some known classes, our method incrementally proposes complementary clusters. In tests on various business datasets, we consistently get relevant results and at interactive time scales. This paper describes the method and demonstrates its superior ability using publicly available datasets. For automated evaluation, we devised a unique cluster evaluation framework to match the business user's utility.

Scientific Track
Authors:
Debakar Shamanta, UTEP
Sheikh Motahar Naim, University of Texas at El Paso
Parang Saraf, VT
Naren Ramakrishnan, Virginia Tech
M. Shahriar Hossain, UTEP

Abstract:
Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated inthe probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style withbetter runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.

Scientific Track
Authors:
Markus Ring, HS Coburg
Florian Otto, www.hs-coburg.de
Martin Becker, www.informatik.uni-wuerzburg.de
Niebler Thomas, University of Wuerzburg
Dieter Landes, www.hs-coburg.de
Andreas Hotho, University of Kassel

Abstract:
A distance measure between objects is a key requirement for many data mining tasks like clustering, classification or outlier detection. However, for objects described by categorical attributes, defining meaningful distance measures is a challenging task since the values within such attributes have no inherent order, especially without additional domain knowledge.In this paper, we propose an unsupervised distance measure forobjects with categorical attributes based on the idea that categorical attribute values are similar if they appear with similar value distributions on correlated context attributes.Thus, the distance measure is automatically derived from the given data set. We compare our new distance measure to existing categorical distance measures and evaluate on different data sets from the UCI machine-learning repository. The experiments show that our distance measure is recommendable, since it achieves similar or better results in a more robust way than previous approaches.

Machine Learning
Authors:

Cong Leng, Jian Cheng

Affiliation(s):
National Natural Science Foundation of China

Abstract:
Hashing techniques have been widely used in many machine learning applications because of their efficiency in both computation and storage. Although a variety of hashing methods have been proposed, most of them make some implicit assumptions about the statistical or geometrical structure of data. In fact, few hashing algorithms can adequately handle all kinds of data with different structures. When considering hybrid structure datasets, different hashing algorithms might produce different and possibly inconsistent binary codes. Inspired by the successes of classifier combination and clustering ensembles, in this paper, we present a novel combination strategy for multiple hashing results, named Consensus Hashing (CH). By defining the measure of consensus of two hashing results, we put forward a simple yet effective model to learn consensus hash functions which generate binary codes consistent with the existing ones. Extensive experiments on several large scale benchmarks demonstrate the overall superiority of the proposed method compared with state-of-the art hashing algorithms.

Scientific Track
Authors:
Mathieu Blondel, NTT CS labs
Akinori Fujino, www.lab.ntt.co.jp
Naonori Ueda, www.lab.ntt.co.jp

Abstract:
Factorization machines are a generic framework which allows to mimic most factorization models simply by feature engineering. In this way, they combine the high predictive accuracy of factorization models with the flexibility of feature engineering. Unfortunately, factorization machines involve a non-convex optimization problem. Thus, we can only obtain a local solution, the quality of which depends on initialization. In this paper, we propose a convex formulation of factorization machines based on the nuclear norm. Our formulation imposes fewer restrictions on the learned model and is thus more general than the original formulation. To solve the corresponding optimization problem, we present an efficient globally-convergent two-block coordinate descent algorithm. Empirically, we demonstrate that our approach achieves comparable or better predictive accuracy than the original factorization machines on 4 recommendation tasks and scales to datasets with 10 million samples.

Machine Learning
Authors:

Eugene Belilovsky, Andreas Argyriou, Gael Varoquaux, Matthew B. Blaschko

Affiliation(s):
Ecole Centrale Paris
INRIA, France

Abstract:
We study the problem of statistical estimation with a signal known to be sparse, spatially contiguous, and containing many highly correlated variables. We take inspiration from the recently introduced k-support norm, which has been successfully applied to sparse prediction problems with correlated features, but lacks any explicit structural constraints commonly found in machine learning and image processing. We address this problem by incorporating a total variation penalty in the k-support framework. We introduce the (k,s) support total variation norm as the tightest convex relaxation of the intersection of a set of sparsity and total variation constraints. We show that this norm leads to an intractable combinatorial graph optimization problem, which we prove to be NP-hard. We then introduce a tractable relaxation with approximation guarantees that scale well for grid structured graphs. We devise several first-order optimization strategies for statistical parameterestimation with the described penalty. We demonstrate the effectiveness of this penalty on classification in the low sample regime, classification with M/EEG neuroimaging data, and image recovery with synthetic and real data background subtracted image recovery tasks. We extensively analyse the application of our penalty on the complex task of identifying predictive regions from low-sample high-dimensional fMRI brain data, we show that our method is particularly useful compared to existing methods in terms of accuracy, interpretability, and stability.

Industrial Track
Authors:
Romain Guigourès, Zalando
Dominique Gay, Orange
Boullé Marc, Orange Labs
Fabrice Clérot, Orange Labs
Fabrice Rossi, SAMM EA 4543

Abstract:
Call Detail Records (CDRs) are data recorded by telecommunications companies, consisting of basic informations related to several dimensions of the calls made through the network: the source, destination, date and time of calls. CDRs data analysis has received much attention in the recent years since it might reveal valuable information about human behavior. It has shown high added value in many application domains like e.g., communities analysis or network planning.In this paper, we suggest a generic methodology based on data grid models for summarizing information contained in CDRs data. The method is based on a parameter-free estimation of the joint distribution of the variables that describe the calls. We also suggest several well-founded criteria that allows one to browse the summary at various granularities and to explore the summary by means of insightful visualizations. The method handles network graph data, temporal sequence data as well as user mobility data stemming from original CDRs data. We show the relevance of our methodology on real-world CDRs data from Ivory Coast for various case studies, like network planning strategy and yield management pricing strategy.

Demo
Authors:

Ilaria Tiddi, Mathieu d'Aquin, Enrico Motta

Affiliation(s):
Knowledge Media Institue

Abstract:
In this paper we present the system Dedalo, whose aim is to generate explanations for data patterns using background knowledge retrieved from Linked Data. In many real-world scenarios, patterns are generally manually interpreted by the experts that have to use their own background knowledge to explain and refine them, while their workload could be relieved by exploiting the open and machine-readable knowledge existing on the Web nowadays. In the light of this, we devised an automatic system that, given some patterns and some background knowledge extracted from Linked Data, reasons upon those and creates well-structured candidate explanations for their grouping. In our demo, we show how the system provides a step towards automatising the interpretation process in KDD, by presenting scenarios in different domains, data and patterns.

Scientific Track
Authors:
Vikas Raykar, IBM Research
Amrita Saha, IBM Research

Abstract:
A conventional textbook prescription for building good predictive models is to split the data into three parts: training set (for model fitting), validation set (for model selection), and test set (for final model assessment). Predictive models can potentially evolve over time as developers improve their performance either by acquiring new data or improving the existing model. The main contribution of this paper is to discuss problems encountered and propose workflows to manage the allocation of newly acquired data into different sets in such dynamic model building and updating scenarios. Specifically we propose three different workflows (parallel dump, serial waterfall, and hybrid) for allocating new data into the existing training, validation, and test splits. Particular emphasis is laid on avoiding the bias due to the repeated use of the existing validation or the test set.

Nectar Track
Authors:

Harald Bosch, Robert Krüger, Dennis Thom

Affiliation(s):
Institut für Visualisierung und Interaktive Systeme (VIS)

Abstract:
Geolocated social media data streams are challenging data sources due to volume, velocity, variety, and unorthodox vocabulary. However, they also are an unrivaled source of eye-witness accounts to establish remote situational awareness. In this paper we summarize some of our approaches to separate relevant information from irrelevant chatter using unsupervised and supervised methods alike. This allows the structuring of requested information as well as the incorporation of unexpected events into a common overview of the situation. A special focus is put on the interplay of algorithms, visualization, and interaction.

Scientific Track
Authors:

Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, Yoshua Bengio

Affiliation(s):
Université de Montréal

Abstract:
Back-propagation has been the workhorse of recent successes of deeplearning but it relies on infinitesimal effects (partial derivatives) in order to performcredit assignment. This could become a serious issue as one considersdeeper and more non-linear functions, e.g., consider the extreme case of nonlinearitywhere the relation between parameters and cost is actually discrete. Inspiredby the biological implausibility of back-propagation, a few approacheshave been proposed in the past that could play a similar credit assignment role. Inthis spirit, we explore a novel approach to credit assignment in deep networks thatwe call target propagation. The main idea is to compute targets rather than gradients,at each layer. Like gradients, they are propagated backwards. In a way thatis related but different from previously proposed proxies for back-propagationwhich rely on a backwards network with symmetric weights, target propagationrelies on auto-encoders at each layer. Unlike back-propagation, it can be appliedeven when units exchange stochastic bits rather than real numbers. We show thata linear correction for the imperfectness of the auto-encoders, called differencetarget propagation, is very effective to make target propagation actually work,leading to results comparable to back-propagation for deep networks with discreteand continuous units and denoising auto-encoders and achieving state ofthe art for stochastic networks.

Scientific Track
Authors:
Rina Okada, University of Tsukuba
Kazuto Fukuchi, University of Tsukuba
Jun Sakuma, University of Tsukuba

Abstract:
This paper presents an investigation of differentially private analysis of distance-based outliers.Outlier detection aims to identify instances that are apparently distant from other instances. Meanwhile, the objective of differential privacy is to conceal the presence (or absence) of any particular instance. Outlier detection and privacy protection are therefore intrinsically conflicting tasks.In this paper, we present differentially private queries for counting outliers that appear in a given subspace, instead of reporting the outliers detected. Our analysis of the global sensitivity of outlier counts reveals that regular global sensitivity-based methods can make the outputs too noisy, particularly when the dimensionality of the given subspace is high. Noting that the counts of outliers are typically expected to be small compared to the number of data, we introduce a mechanism based on the smooth upper bound of the local sensitivity.This study is the first trial to ensure differential privacy for distance-based outlier analysis. The experimentally obtained results show that our method achieves better utility than global sensitivity-based methods do.

Machine Learning
Rollover 2014
Authors:
Motoki Shiga, Gifu University
Voot Tangkaratt, Tokyo Institute of Technology
Masashi Sugiyama, Tokyo Institute of Technology

Abstract:
Regression is a fundamental problem in statistical data analysis, which aims at estimating the conditional mean of output given input. However, regression is not informative enough if the conditional probability density is multi-modal, asymmetric, and heteroscedastic. To overcome this limitation, various estimators of conditional densities themselves have been developed, and a kernel-based approach called leastsquares conditional density estimation (LS-CDE) was demonstrated to be promising. However, LS-CDE still suffers from large estimation error if input contains many irrelevant features. In this paper, we therefore propose an extension of LS-CDE called sparse additive CDE (SA-CDE), which allows automatic feature selection in CDE. SACDE applies kernel LS-CDE to each input feature in an additive manner and penalizes the whole solution by a group-sparse regularizer. We also give a subgradient-based optimization method for SA-CDE training that scales well to high-dimensional large data sets. Through experiments with benchmark and humanoid robot transition datasets, we demonstrate the usefulness of SA-CDE in noisy CDE problems.

Scientific Track
Authors:
Shuyang Lin, UIC
Qingbo Hu, www.uic.edu
Jingyuan Zhang, www.uic.edu
Philip Yu, UIC

Abstract:
Recently, user influence in social networks has been studied extensively. Many applications related to social influence depend on quantifying influence and finding the most influential users of a social network. Most existing work studies the global influence of users, i.e. the aggregated influence that a user has on the entire network. It is often overlooked that users may be significantly more influential to some audience groups than others. In this paper, we propose AudClus, a method to detect audience groups and identify group-specific influencers simultaneously. With extensive experiments on real data, we show that AudClus is effective in both the task of detecting audience groups and the task of identifying influencers of audience groups. We further show that AudClus makes possible for insightful observations on the relation between audience groups and influencers. The proposed method leads to various applications in areas such as viral marketing, expert finding, and data visualization.

Nectar Track
Authors:

Mathis Boerner, Tim Ruhe, Katharina Morik, Wolfgang Rhode

Affiliation(s):
TU Dortmund University

Abstract:
Astrophysical experiments produce Big Data which need efficient and effective data analytics. In this paper we present a general data analysis process which has been successfully applied to data from the IceCube, a cubic-kilometer large neutrino detector located at the geographic South Pole. The goal of the analysis is to separate neutrinos from the background within the data to determine the muon neutrino energy spectrum. The presented process covers straight cuts, feature selection, classification, and unfolding. A major challenge in the separation is the unbalanced dataset. The expected signal to background ratio was worse than 1:1000 and, moreover, any surviving background would hinder further analysis of the data. The overall process was embedded in a multi-fold cross-validation to control its performance. A following regularized unfolding yields the sought-after energy spectrum.

Scientific Track
Authors:
Junting Ye, Stony Brook University
Akoglu Leman, Stonybrook

Abstract:
Online reviews are an important source for consumers to evaluate products/services on the Internet (e.g. Amazon, Yelp, etc.). However, more and more fraudulent reviewers write fake reviews to mis-lead users. To maximize their impact and share effort, many spam attacks are organized as campaigns, by a group of spammers. In this paper, we propose a new two-step method to discover spammer groups and their targeted products. First, we introduce NFS (Network Footprint Score), a new measure that quantifies the likelihood of products being spam campaign targets. Second, we carefully devise GroupStrainer to cluster spammers on a 2-hop subgraph induced by top ranking products. We demonstrate the efficiency and effectiveness of our approach on both synthetic and real-world datasets from two different domains with millions of products and reviewers. Moreover, we discover interesting strategies that spammers employ through case studies of our detected groups.

Scientific Track
Authors:

Rana Haber, Anand Rangarajan, Adrian Peter

Affiliation(s):
Florida Institute of Technology
University of Florida

Abstract:
The modus operandi for machine learning is to represent data as feature vectors and then proceed with training algorithms that seek to optimally partition the feature space S ⊂ R^n into labeled regions. This holds true even when the original data are functional in nature, i.e. curves or surfaces that are inherently varying over a continuum such as time or space. Functional data are often reduced to summary statistics, locally-sensitive characteristics, and global signatures with the objective of building comprehensive feature vectors that uniquely characterize each function. The present work directly addresses representational issues of functional data for supervised learning. We propose a novel discriminative interpolation framework wherein functional data in the same class are adaptively reconstructed to be more similar to each other, while simultaneously repelling nearest neighbor functional data in other classes. Akin to other recent nearest-neighbor metric learning paradigms like stochastic k-neighborhood selection and large margin nearest neighbors, our technique uses class-specific representations which gerrymander similar functional data in an appropriate parameter space. Experimental validation on several time series data sets establish the proposed discriminative interpolation framework as competitive or better in comparison to recent state-of-the-art techniques which continue to rely on the standard feature vector representation.

Data Mining and Knowledge Discovery
Authors:

Alexios Kotsifakos, Alexandra Stefan, Vassilis Athitsos, Gautam Das, Panagiotis Papapetrou

Affiliation(s):
University of Texas at Arlington
Department of Computer and Systems Sciences, Stockholm University

Abstract:
Similarity search in large sequence databases is a problem ubiquitous in a wide range of application domains, including searching biological sequences. In this paper we focus on protein and DNA data, and we propose a novel approximate method method for speeding up range queries under the edit distance. Our method works in a filter-and-refine manner, and its key novelty is a query-sensitive mapping that transforms the original string space to a new string space of reduced dimensionality. Specifically, it first identifies the t most frequent codewords in the query, and then uses these codewords to convert both the query and the database to a more compact representation. This is achieved by replacing every occurrence of each codeword with a new letter and by removing the remaining parts of the strings. Using this new representation, our method identifies a set of candidate matches that are likely to satisfy the range query, and finally refines these candidates in the original space. The main advantage of our method, compared to alternative methods for whole sequence matching under the edit distance, is that it does not require any training to create the mapping, and it can handle large query lengths with negligible losses in accuracy. Our experimental evaluation demonstrates that, for higher range values and large query sizes, our method produces significantly lower costs and runtimes compared to two state-of-the-art competitor methods.

Scientific Track
Authors:
David Huang, The University of Auckland
Yun Sing Koh, University of Auckland
Gillian Dobbie, The University of Auckland
Albert Bifet, Huawei Noah's Ark Lab

Abstract:
Current methods in data streams that detect concept drifts in the underlying distribution of data look at the distribution difference using statistical measures based on mean and variance. Existing methods are unable to proactively approximate the probability of a concept drift occurring and predict future drift points. We extend the current drift detection design by proposing the use of historical drift trends to estimate the probability of expecting a drift at different points across the stream, which we term the expected drift probability. We offer empirical evidence that applying our expected drift probability with the state-of-the-art drift detector, ADWIN, we can improve the detection performance of ADWIN by significantly reducing the false positive rate. To the best of our knowledge, this is the first work that investigates this idea. We also show that our overall concept can be easily incorporated back onto incremental classifiers such as VFDT and demonstrate that the performance of the classifier is further improved.

Scientific Track
Authors:
Dirk Schäfer, University of Marburg
Eyke Huellermeier, Univ. of Paderborn

Abstract:
Label ranking is a specific type of preference learning problem, namely the problem of learning a model that maps instances to rankings over a finite set of predefined alternatives. These alternatives are identified by their name or label while not being characterized in terms of any properties or features that could be potentially useful for learning. In this paper, we consider a generalization of the label ranking problem that we call dyad ranking. In dyad ranking, not only the instances but also the alternatives are represented in terms of attributes. For learning in the setting of dyad ranking, we propose an extension of an existing label ranking method based on the Plackett-Luce model, a statistical model for rank data. Moreover, we present first experimental results confirming the usefulness of the additional information provided by the feature description of alternatives.

Data Mining and Knowledge Discovery
Rollover 2014
Authors:
Sarvenaz Choobdar, University of Porto, Faculty of Sciences - Computer Science Department
Pedro Ribeiro, University of Porto, Faculty of Sciences - Computer Science Department
Srinivasan Parthasarathy, The Ohio State University
Fernando Silva, University of Porto, Faculty of Sciences - Computer Science Department

Abstract:
Nodes in complex networks inherently represent different kinds of functional or organizational roles. In the dynamic process of an information cascade, users play different roles in spreading the information: some act as seeds to initiate the process, some limit the propagation and others are in-between. Understanding the roles of users is crucial in modeling the cascades. Previous research mainly focuses on modeling users behavior based upon the dynamic exchange of information with neighbors. We argue however that the structural patterns in the neighborhood of nodes may already contain enough information to infer users' roles, independently from the information flow in itself. To approach this possibility, we examine how network characteristics of users affect their actions in the cascade. We also advocate that temporal information is very important. With this in mind, we propose an unsupervised methodology based on ensemble clustering to classify users into their social roles in a network, using not only their current topological positions, but also considering their history over time. Our experiments on two social networks, Flickr and Digg, show that topological metrics indeed possess discriminatory power and that different structural patterns correspond to different parts in the process. We observe that user commitment in the neighborhood affects considerably the influence score of users. In addition, we discover that the cohesion of neighborhood is important in the blocking behavior of users. With this we can construct topological fingerprints that can help us in identifying social roles, based solely on structural social ties, and independently from nodes activity and how information flows.

Scientific Track
Authors:
Asma DACHRAOUI, AgroParisTech & EDF
Alexis BONDU, EDF R & D
Antoine CORNUEJOLS, AgroParisTech

Abstract:
Classification of time series as early as possible is a most sought after goal. Indeed, in many application domains, the earliest the decision, the more rewarding it can be. Yet, often, gathering more information allows one to get a better decision. The optimization of this time vs. accuracy tradeoff must generally be solved online and is a complex problem.This paper presents a formal criterion that expresses this trade-off in all generality together with a generic sequential meta algorithm to solve it. This meta algorithm is interesting in two ways. First, it pinpoints where choices can (have to) be made to obtain a computable algorithm. As a result a wealth of algorithmic solutions can be found. Second, it seeks online the earliest time in the future where a minimization of the criterion can be expected. It thus goes beyond the classical approaches that myopically decide at each time step whether to make a decision or to postpone the call one more time step.After this general setting has been expounded, we study one simple declination of the meta-algorithm, and we show the results obtained on synthetic time series data sets chosen for their ability to test the robustness and properties of the technique.The general approach is vindicated by the experimental results, which allows us to point to promising perspectives.

Industrial Track
Authors:
Hani Neuvirth, Microsoft
Yehuda Yehuda Finkelstein, Microsoft
Amit Hilbuch, Microsoft
Shai Nahum, Microsoft
Daniel Alon, Microsoft
Elad Yom-Tov, Microsoft Research

Abstract:
Cloud computing resources are sometimes hijacked for fraudulent use. While some fraudulent use manifests as a small-scale resource consumption, a more se-rious type of fraud is that of fraud storms, which are events of large-scale fraudu-lent use. These events begin when fraudulent users discover new vulnerabilities in the sign up process, which they then exploit in mass. The ability to perform early detection of these storms is a critical component of any cloud-based public computing system.In this work we analyze telemetry data from Microsoft Azure to detect fraud storms and raise early alerts on sudden increases in fraudulent use. The use of machine learning approaches to identify such anomalous events involves two in-herent challenges: the scarcity of these events, and at the same time, the high fre-quency of anomalous events in cloud systems.We compare the performance of a supervised approach to the one achieved by an unsupervised, multivariate anomaly detection framework. We further evaluate the system performance taking into account practical considerations of robustness in the presence of missing values, and minimization of the model’s data collection period.This paper describes the system, as well as the underlying machine learning algo-rithms applied. A beta version of the system is deployed and used to continuous-ly control fraud levels in Azure.

Data Mining and Knowledge Discovery
Authors:

Nicola Barbieri, Francesco Bonchi, Edoardo Galimberti, Francesco Gullo

Affiliation(s):
Yahoo! Labs

Abstract:
Community search is the problem of finding a good community for a given set of query vertices. One of the most studied formulations of community search asks for a connected subgraph that contains all query vertices and maximizes the minimum degree. All existing approaches to min-degree-based community search suffer from limitations concerning efficiency, as they need to visit (large part of) the whole input graph, as well as accuracy, as they output communities quite large and not really cohesive. Moreover, some existing methods lack generality: they handle only single-vertex queries, find communities that are not optimal in terms of minimum degree, and/or require input parameters. In this work we advance the state of the art on community search by proposing a novel method that overcomes all these limitations: it is in general more efficient and effective—one/two orders of magnitude on average, it can handle multiple query vertices, it yields optimal communities, and it is parameter-free. These properties are confirmed by an extensive experimental analysis performed on various real-world graphs.

Scientific Track
Authors:
Hsun-Ping Hsieh, National Taiwan Univrsity
Cheng-Te Li, Academia Sinica
Lin shou-de, National Taiwan University

Abstract:
Acquiring the knowledge about the volume of customers for places and time of interest has several benefits such as determining the locations of new retail stores and planning advertising strategies. This paper aims to estimate the number of potential customers of arbitrary query locations and any time of interest in modern urban areas. Our idea is to consider existing established stores as a kind of sensors because the near-by human activities of the retail stores characterize the geographical properties, mobility patterns, and social behaviors of the target customers. To tackle the task based on store sensors, we develop a method called Potential Customer Estimator (PCE), which models the spatial and temporal correlation between existing stores and query locations using geographical, mobility, and features on location-based social networks. Experiments conducted on NYC Foursquare and Gowalla data, with three popular retail stores, Starbucks, McDonald's, and Dunkin' Donuts exhibit superior results over state-of-the-art approaches.

Scientific Track
Authors:

Qingming Tang, Harry Yang, Jian Peng, JInbo Xu

Affiliation(s):
Toyota Technological Institute
UIUC

Abstract:
This paper considers the problem of estimating multiple related Gaussian graphical models from a p-dimensional dataset consisting of different classes. Our work is based upon the formulation of this problem as group graphical lasso. This paper proposes a novel hybrid covariance thresholding algorithm that can effectively identify zero entries in the precision matrices and split a large joint graphical lasso problem into small subproblems. Our hybrid covariance thresholding method is superior to existing uniform thresholding methods in that our method can split the precision matrix of each individual class using different partition schemes and thus split group graphical lasso into much smaller subproblems, each of which can be solved very fast. In addition, this paper establishes necessary and sufficient conditions for our hybrid covariance thresholding algorithm. The superior performance of our thresholding method is thoroughly analyzed and illustrated by a few experiments on simulated data and real gene expression data.

Scientific Track
Authors:
Aleksey Buzmakov, LORIA(INRIA-CNRS-U. Lorraine)
Sergei Kuznetsov, Higher School of Economics
Amedeo Napoli, Inria Nancy Grand Est / LORIA

Abstract:
In pattern mining, the main challenge is the exponential explosion of the set of patterns. Typically, to solve this problem, a constraint for pattern selection is introduced. One of the first constraints proposed in pattern mining is support (frequency) of a pattern in a dataset. Frequency is an anti-monotonic function, i.e., given an infrequent pattern, all its superpatterns are not frequent. However, many other constraints for pattern selection are not (anti-)monotonic, which makes it difficult to generate patterns satisfying these constraints.In this paper we introduce the notion of projection-antimonotonicity and theta-Sofia algorithm that allows efficient generation of the best patterns for some nonmonotonic constraints.In this paper we consider stability and delta-measure, which are nonmonotonic constraints, and apply them to interval tuple datasets. In the experiments, we compute best interval tuple patterns w.r.t. these measures and show the advantage of our approach over postfiltering approaches.

Scientific Track
Authors:

Chao Zhang, Shan Jiang, Yucheng Chen, Yidan Sun, Jiawei Han

Affiliation(s):
UIUC

Abstract:
Random walk with restart (RWR) is widely recognized as one of the most important node proximity measures for graphs, as it captures the holistic graph structure and is robust to noise in the graph. In this paper, we study a novel query based on the RWR measure, called the inbound top-k (Ink) query. Given a query node q and a number k, the Ink query aims at retrieving k nodes in the graph that have the largest weighted RWR scores to q. Ink queries can be highly useful for various applications such as traffic scheduling, disease treatment, and targeted advertising. Nevertheless, none of the existing RWR computation techniques can accurately and efficiently process the Ink query in large graphs. We propose two algorithms, namely Squeeze and Ripple, both of which can accurately answer the Ink query in a fast and incremental manner. To identify the top-k nodes, Squeeze iteratively performs matrix-vector multiplication and estimates the lower and upper bounds for all the nodes in the graph. Ripple employs a more aggressive strategy by only estimating the RWR scores for the nodes falling in the vicinity of q, which makes it even more efficient than Squeeze. The nodes outside the vicinity do not need to be evaluated because their RWR scores are propagated from the boundary of the vicinity and thus upper bounded. Ripple incrementally expands the vicinity until the top-k result set can be obtained. Our extensive experiments on real-life graph data sets show that Ink queries can retrieve interesting results, and the proposed algorithms are orders of magnitude faster than state-of-the-art method.

Scientific Track
Authors:
Paul Mineiro, Microsoft CISL
Nikos Karampatziakis, Microsoft CISL

Abstract:
Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorithm which works in both the multiclass and multilabel settings. The result is a randomized algorithm whose running time is exponentially faster than naive algorithms. We demonstrate our techniques on two large-scale public datasets, from the Large Scale Hierarchical Text Challenge and the Open Directory Project, where we obtain state of the art results.

Scientific Track
Authors:
Sebastian Pölsterl, Technische Universität München
Nassir Navab, www.tum.de
Amin Katouzian, www.cs.tum.edu

Abstract:
Survival analysis is a commonly used technique to identify important predictors of adverse events and develop guidelines for patient's treatment in medical research. When applied to large amounts of patient data, efficient optimization routines become a necessity. We propose efficient training algorithms for three kinds of linear survival support vector machines: 1) ranking-based, 2) regression-based, and 3) combined ranking and regression. We perform optimization in the primal using truncated Newton optimization and use order statistic trees to lower computational costs of training. We employ the same optimization technique and extend it for non-linear models too. Our results demonstrate the superiority of our proposed optimization scheme over existing training algorithms, which fail due to their inherently high time and space complexities when applied to large datasets. We validate the proposed survival models on 6 real-world datasets, and show that pure ranking-based approaches outperform regression and hybrid models.

Scientific Track
Authors:
Matt Revelle, George Mason University
Carlotta Domeniconi, George Mason University
Mack Sweeney, George Mason University
Aditya Johri, George Mason University

Abstract:
Community detection in networks is a broad problem with many proposed solutions. Existing methods frequently make use of edge density and node attributes; however, the methods ultimately have different definitions of community and build strong assumptions about community features into their models. We propose a new method for community detection, which estimates both per-community feature distributions (topics) and per-node community membership. Communities are modeled as connected subgraphs with nodes sharing similar attributes. Nodes may join multiple communities and share common attributes with each. Communities have an associated probability distribution over attributes and node attributes are modeled as draws from a mixture distribution. We make two basic assumptions about community structure: communities are densely connected and have a small network diameter. These assumptions inform the estimation of community topics and membership assignments without being too prescriptive. We present competitive results against state-of-the-art methods for finding communities in networks constructed from NSF awards, the DBLP repository, and the Scratch online community.

Scientific Track
Authors:
Vinay Jethava, ETH Zurich
Niko Beerenwinkel, ETH Zurich

Abstract:
This paper considers the problem of finding large dense subgraphs in relational graphs, i.e., a set of graphs which share a common vertex set. We present an approximation algorithm for finding the densest common subgraph in a relational graph setbased on an extension of Charikar's linear programming basedapproach for finding the densest subgraph in a single graph. We also present a simple greedyheuristic which can be implemented efficiently foranalysis of larger graphs. We give graph dependent bounds on thequality of the solutions returned by our methods. Lastly,we show by empirical evaluation on severalbenchmark datasets that our method out-performs existing approaches.

Data Mining and Knowledge Discovery
Rollover 2014
Authors:
Orestis Kostakis, F-Secure Labs
Panagiotis Papapetrou, Stockholm University, Department of Computer and Systems Sciences

Abstract:
"We study the problem of finding the Longest Common Sub-Pattern (LCSP) shared by two sequences of temporal intervals. In particular we are interested in finding the LCSP of the corresponding arrangements. Arrangements of temporal intervals are a powerful way to encode multiple concurrent labeled events that have a time duration. Discovering commonalities among such arrangements is useful for a wide range of scientific fields and applications, as it can be seen by the number and diversity of the datasets we use in our experiments. In this paper, we define the problem of LCSP and prove that it is NP-complete by demonstrating a connection between graphs and arrangements of temporal intervals, which leads to a series of interesting open problems. In addition, we provide an exact algorithm to solve the LCSP problem, and also propose and experiment with three polynomial time and space underapproximation techniques. Finally, we introduce two upper bounds for LCSP and study their suitability for speeding up 1-NN search. Experiments are performed on seven datasets taken from a wide range of real application domains, plus two synthetic datasets."