Accepted Papers

Industrial Track
Authors:
Thanh Lam Hoang, IBM
Eric Bouillet, IBM Research

Abstract:
Given a set of historical bus trajectories D and a partially observed bus trajectory S up to position l on the bus route, kernel regression (KR) is a non-parametric approach which predicts the arrival time of the bus at location l+h (h > 0) by averaging the arrival times observed at same location in the past. The KR method does not weights the historical data equally but it gives more preference to more similar trajectories in the historical data. This method has been shown to outperform the baseline methods such as linear regression or k-nearest neighbour algorithms for bus arrival time prediction problems. However, the performance of KR is very sensitive to the method of evaluating similarity between trajectories. General kernel regression algorithm looks back to the entire trajectory for evaluating similarity. In the case of bus arrival time prediction, this approach does not work well when outdated part of the trajectories does not reflect the most recent behaviour of the buses. In order to solve this issue, we propose an approach that considers only recent part of the trajectories in a sliding window for evaluating the similarity between them. The approach introduces a set of parameters corresponding to the window lengths at every position along the bus route determining how long we should look back into the past for evaluating the similarity between trajectories. These parameters are automatically learned from training data. Nevertheless, parameter learning is a time-consuming process given large training data (at least quadratic in the training size). Therefore, we proposed an approximation algorithm with guarantees on error bounds to learn the parameters efficiently. The approximation algorithm is an order of magnitude faster than the exact algorithm. In an experiment with a real-world application deployed for Dublin city, our approach significantly reduced the prediction error compared to the state of the art kernel regression algorithm.

Scientific Track
Authors:
Ayan Acharya, University of Texas at Austin
Dean Teffer, Applied Research Laboratories
Jette Henderson, Applied Research Laboratories
Marcus Tyler, Applied Research Laboratories
Mingyuan Zhou, University of Texas at Austin
Joydeep Ghosh, University of Texas at Austin

Abstract:
Developing models to discover, analyze, and predict clusters within networked entities is an area of active and diverse research. However, many of the existing approaches do not take into consideration pertinent auxiliary information. This paper introduces Joint Gamma Process Poisson Factorization (J-GPPF) to jointly model network and side-information. J-GPPF naturally fits sparse networks, accommodates separately clustered side information in a principled way, and effectively addresses the computational challenges of analyzing large networks. Evaluated with hold-out link prediction performance on sparse networks (both synthetic and real-world) with side information, J-GPPF is shown to clearly outperform algorithms that only model the network adjacency matrix.

Demo
Authors:

Pierre Houdyer, Albrecht Zimmerman, Mehdi Kaytoue, Plantevit Marc, Robardet Celine, Joseph Mitchell

Affiliation(s):
TAPASTREET LTD.
University Lyon 1
INSA de Lyon

Abstract:
We present 'Gazouille', a system for discoveringlocal events in geo-localized social media streams. The system isbased on three core modules: (i) social networks data acquisition onseveral urban areas, (ii) event detection through time series analysis,and (iii) a Web user interface to present events discovered in real-time in a city,associated with a gallery of social media that characterize the event.

Machine Learning
Rollover 2014
Authors:
Theja Tulabandhula, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology
Cynthia Rudin, MIT Sloan School of Management

Abstract:
In this paper, we consider a supervised learning setting where side knowledge is provided about the labels of unlabeled examples. The side knowledge has the effect of reducing the hypothesis space, leading to tighter generalization bounds, and thus possibly better generalization. We consider several types of side knowledge, the first leading to linear and polygonal constraints on the hypothesis space, the second leading to quadratic constraints, and the last leading to conic constraints. We show how different types of domain knowledge can lead directly to these kinds of side knowledge. We prove bounds on complexity measures of the hypothesis space for quadratic and conic side knowledge, and show that these bounds are tight in a specific sense for the quadratic case.

Scientific Track
Authors:

Karim Abou-Moustafa, Dale Schuurmans

Affiliation(s):
University of Alberta

Abstract:
We are interested in the following questions. Given a finite data set S, with neither labels nor side information, and an unsupervised learning algorithm A, can the generalization of A be assessed on S? Similarly, given two unsupervised learning algorithms, A1 and A2, for the same learning task, can one assess whether one will generalize "better'' on future data drawn from the same source as S? In this paper, we develop a general approach to answering these questions in a reliable and efficient manner using mild assumptions on A. We first propose a concrete generalization criterion for unsupervised learning that is analogous to prediction error in supervised learning. Then, we develop a computationally efficient procedure that realizes the generalization criterion on finite data sets, and propose and extension for comparing the generalization of two algorithms on the same data set. We validate the overall framework on algorithms for clustering and dimensionality reduction (linear and nonlinear).

Data Mining and Knowledge Discovery
Authors:

Reihaneh Rabbany, Osmar R. Zaiane

Affiliation(s):
University of Alberta

Abstract:
A measure of distance between two clusterings has important applications, including clustering validation and ensemble clustering. Generally, such distance measure provides navigation through the space of possible clusterings. Mostly used in cluster validation, a normalized clustering distance, a.k.a. agreement measure, compares a given clustering result against the ground-truth clustering. The two widely-used clustering agreement measures are Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). In this paper, we present a generalized clustering distance from which these two measures can be derived. We then use this generalization to construct new measures specific for comparing (dis)agreement of clusterings in networks, a.k.a. communities. Further, we discuss the difficulty of extending the current, contingency based, formulations to overlapping cases, and present an alternative algebraic formulation for these (dis)agreement measures. Unlike the original measures, the new co-membership based formulation is easily extendable for different cases, including overlapping clusters and clusters of inter-related data. These two extensions are, in particular, important in the context of finding communities in complex networks.

Machine Learning
Authors:
Brijnesh J. Jain, Technische Universität Berlin

Abstract:
The majority of machine learning algorithms assumes that objects are represented as vectors. But often the objects we want to learn on are more naturally represented by other data structures such as sequences and time series. For these representations many standard learning algorithms are unavailable. We generalize gradient-based learning algorithms to time series under dynamic time warping. To this end, we introduce elastic functions, which extend functions on Euclidean spaces to time series spaces. Necessary conditions are sketched under which generalized gradient learning on time series is consistent. Specifically, four linear classifiers are extended to time series under dynamic time warping and applied to benchmark datasets. Results indicate that generalized gradient learning via elastic functions have the potential to complement the state-of-the-art in pattern recognition on time series.

Scientific Track
Authors:
Miettinen Pauli, Max-Planck-Institut f�r Informatik

Abstract:
Matrix factorizations are a popular tool to mine regularities from data. There are many ways to interpret the factorizations, but one particularly suited for data mining utilizes the fact that a matrix product can be interpreted as a sum of rank-1 matrices. Then the factorization of a matrix becomes the task of finding a small number of rank-1 matrices, sum of which is a good representation of the original matrix. Seen this way, it becomes obvious that many problems in data mining can be expressed as matrix factorizations with correct definitions of what a rank-1 matrix and a sum of rank-1 matrices mean. This paper develops a unified theory, based on generalized outer product operators, that encompasses many pattern set mining tasks. The focus is on the computational aspects of the theory and studying the computational complexity and approximability of many problems related to generalized matrix factorizations. The results immediately apply to a large number of data mining problems, and hopefully allow generalizing future results and algorithms, as well.

Scientific Track
Authors:
Mohadeseh Ganji, University of Melbourne
Abbas Seifi, www.aut.ac.ir
Hosein Alizadeh, www.iust.ac.ir
Bailey James, University of Melbourne
Peter Stuckey, www.unimelb.edu.au

Abstract:
Detecting the underlying community structure of networks is an important problem in complex network analysis. Modularity is a well-known quality function introduced by Newman, that measures how vertices in a community share more edges than what would be expected in a randomized network. However, this limited view on vertex similarity leads to limits in what can be resolved by modularity.To overcome these limitations, we propose a generalized modularity measure called GM which has a more sophisticated interpretation of vertex similarity.In particular, GM also takes into account the number of longer paths between vertices, compared to what would be expected in a randomized network.We also introduce a unified version of GM which detects communities of unipartite and (near-)bipartite networks without knowing the structure type in advance.Experiments on different synthetic and real data sets, demonstrate GM performs strongly in comparison to several existing approaches, particularly for small-world networks.

Machine Learning
Rollover 2014
Authors:
Mohamed Elhoseiny, Rurgers University
Ahmed Elgammal, Rurgers University

Abstract:
There has been a growing interest in mutual information measures due to its wide range of applications in Machine Learning and Computer Vision. In this manuscript, we present a generalized structured regression framework based on Shama-Mittal divergence, a relative entropy measure, firstly addressed in the Machine Learning community, in this work. Sharma-Mittal (SM) divergence is a generalized mutual information measure for the widely used R\'enyi, Tsallis, Bhattacharyya, and Kullback-Leibler (KL) relative entropies. Specifically, we study Sharma-Mittal divergence as a cost function in the context of the Twin Gaussian Processes, which generalizes over the KL-divergence without computational penalty. We show interesting properties of Sharma-Mittal TGP (SMTGP) through a theoretical analysis, which covers missing insights in the traditional TGP formulation. However, we generalize this theory based on SM-divergence instead of KL-divergence which is a special case. Experimentally, we evaluated the proposed SMTGP framework on several datasets. The results show that SMTGP reaches better predictions than KL-based TGP (KLTGP), since it offers a bigger class of models through its parameters that we learn from the data.

Machine Learning
Authors:

Bo Chen, Kai Ming Ting, Takashi Washio, Gholamreza Haffari

Affiliation(s):
Monash University, Australia

Abstract:
Data depth is a statistical method which models data distribution in terms of centeroutward ranking rather than density or linear ranking. While there are a lot of academic interests, its applications are hampered by the lack of a method which is both robust and efficient. This paper introduces Half-Space Mass which is a significantly improved version of half-space data depth. Half-Space Mass is the only data depth method which is both robust and efficient, as far as we know. We also reveal four theoretical properties of Half-Space Mass: (i) its resultant mass distribution is concave regardless of the underlying density distribution, (ii) its maximum point is unique which can be considered as median, (iii) the median is maximally robust, and (iv) its estimation extends to a higher dimensional space in which the convex hull of the dataset occupies zero volume. We demonstrate the power of Half-Space Mass through its applications in two tasks. In anomaly detection, being a maximally robust location estimator leads directly to a robust anomaly detector that yields a better detection accuracy than halfspace depth; and it runs orders of magnitude faster than L2 depth, an existing maximally robust location estimator. In clustering, the Half-Space Mass version of Kmeans overcomes three weaknesses of K-means.

Scientific Track
Authors:
Benjamin Fish, UIC
Rajmonda Caceres, MIT Lincoln Laboratory

Abstract:
Oversampling is a common characteristic of data representing dynamic networks. It introduces noise into representations of dynamic networks, but there has been little work so far to compensate for it. Oversampling can affect the quality of many important algorithmic problems on dynamic networks, including link prediction. Link prediction asks to predict edges that will be added to the network given previous snapshots of the network. We show that not only does oversampling affect the quality of link prediction, but that we can use link prediction to recover from the effects of oversampling. We also introduce a novel generative model of noise in dynamic networks that represents oversampling. We demonstrate the results of our approach on both synthetic and real-world data.

Scientific Track
Authors:
Yinjie Huang, University of Central Florida
Michael Georgiopoulos, University of Central Florida
Georgios Anagnostopoulos, Florida Institute of Technology

Abstract:
In this paper we introduce a novel hash learning framework that has two main distinguishing features, when compared to past approaches. First, it utilizes codewords in the Hamming space as ancillary means to accomplish its hash learning task. These codewords, which are inferred from the data, attempt to capture grouping aspects of the data's hash codes. Secondly and more importantly, the same framework is capable of addressing supervised, unsupervised and, even, semi-supervised hash learning tasks. A series of comparative experiments focused on content-based image retrieval highlights its performance advantages.

Demo
Authors:

Bijan Ranjbar-Sahraei, Julia Efremova, Hossein Rahmani, Toon Calders, Karl Tuyls, Gerhard Weiss

Affiliation(s):
Maastricht University
University of Liverpool
Université Libre de Bruxelles

Abstract:
Entity Resolution (ER) is the task of finding references that refer to the same entity across different data sources. Cleaning a data warehouse and applying ER on it is a computationally demanding task, particularly for large data sets that change dynamically. Therefore, a query-driven approach which analyses a small subset of the entire data set and integrates the results in real-time is significantly beneficial. Here, we present an interactive tool, called HiDER, which allows for query-driven ER in large collections of uncertain dynamic historical data. The input data includes civil registers such as birth, marriage and death certificates in the form of structured data, and notarial acts such as estate tax and property transfers in the form of free text. The outputs are family networks and event timelines visualized in an integrated way. The HiDER is being used and tested at BHIC center (Brabant Historical Information Center); despite the uncertainties of the BHIC input data, the extracted entities have high certainty and are enriched by extra information.

Scientific Track
Authors:
Xiao Bian, www.ncsu.edu
Xia Ning, iupui
Guofei Jiang, www.nec-labs.com

Abstract:
Sparse coding plays a key role in high dimensional data analysis. One critical challenge of sparse coding is to design a dictionary that is both adaptive to the training data and generalizable to data of same type. In this paper, we propose a novel dictionary learning algorithm to build an adaptive dictionary regularized by a priori over-completed dictionary. This hence leads to a hierarchical sparse structure on the a priori dictionary over the learned dictionary, and a hierarchical sparse structure on the learned dictionary over the data, respectively. We demonstrate that this approach reduces overfitting and enhances the generalizability of the learned dictionary. Moreover, the learned dictionary is optimized to adapt thegiven data and results in a more compact dictionary and a more robust sparse representation.We apply the hierarchical sparse dictionary learning approach on both synthetic data and real-world high-dimensional time series data, where the time series dataexhibit high heterogeneity in their nature, e.g., continuous vs discrete, smooth vs non-smooth, etc. The experimental results on a real dataset from a chemical plant process monitoring system demonstrate that the proposed algorithm can successfully characterize the heterogeneity of the given data, and leads to a better and more robust dictionary.

Scientific Track
Authors:

Anveshi Charuvaka, Rangwala Huzefa

Affiliation(s):
George Mason University

Abstract:
Hierarchical Classification (HC) is an important problem with a wide range of application in domains such as music genre classification, protein function classification and document classification. Although several innovative classification methods have been proposed to address HC, most of them are not scalable to web-scale problems. While simple methods such as top-down "pachinko'' style classification and flat classification scale well, they either have poor classification performance or do not effectively use the hierarchical information. Current methods that incorporate hierarchical information in a principled manner are often computationally expensive and unable to scale to large datasets. In the current work, we adopt a cost-sensitive classification approach to the hierarchical classification problem by defining misclassification cost based on the hierarchy. This approach effectively decouples the models for various classes, allowing us to efficiently train effective models for large hierarchies in a distributed fashion.

Scientific Track
Authors:
Koh Takeuchi, NTT
Yoshinobu Kawahara, Osaka University
Tomoharu Iwata, NTT

Abstract:
We often encounter situations in supervised learning where there exist possibly overlapping groups that consist of more than two parameters. For example, we might work on parameters that correspond to words expressing the same meaning, music pieces in the same genre, and books released in the same year. Based on such auxiliary information, we could suppose that parameters in a group have similar roles in a problem and similar values. In this paper, we propose the Higher Order Fused (HOF) regularization that can incorporate smoothness among parameters with group structures as prior knowledge in supervised learning. We define the HOF penalty as the Lov\'{a}sz extension of a submodular higher-order potential function, which encourages parameters in a group to take similar estimated values when used as a regularizer. Moreover, we develop an efficient network flow algorithm for calculating the proximity operator for the regularized problem. We investigate the empirical performance of the proposed algorithm by using synthetic and real-world data.

Scientific Track
Authors:
Nicolas Schilling, University of Hildesheim
Martin Wistuba, University of Hildesheim
Lucas Drumond, University of Hildesheim
Schmidt-Thieme Lars, University of Hildesheim

Abstract:
In machine learning, hyperparameter optimization is a challenging task that is usually approached by experienced practitioners or in a computationally expensive brute-force manner such as grid-search. Therefore, recent research proposes to use observed hyperparameter performance on already solved problems (i.e. data sets) in order to speed up the search for promising hyperparameter configurations in the sequential model based optimization framework.In this paper, we propose multilayer perceptrons as surrogate models as they are able to model highly nonlinear hyperparameter response surfaces. However, since interactions of hyperparameters, data sets and metafeatures are only implicitly learned in the subsequent layers, we improve the performance of multilayer perceptrons by means of an explicit factorization of the interaction weights and call the resulting model a factorized multilayer perceptron. Additionally, we evaluate different ways of obtaining predictive uncertainty, which is a key ingredient for a decent tradeoff between exploration and exploitation. Our experimental results on two public meta data sets demonstrate the efficiency of our approach compared to a variety of published baselines. For reproduction purposes, we make our data sets and all the program code publicly available on our supplementary webpage.

Scientific Track
Authors:
Martin Wistuba, University of Hildesheim
Nicolas Schilling, University of Hildesheim
Schmidt-Thieme Lars, University of Hildesheim

Abstract:
The optimization of hyperparameters is often done manually or exhaustively but recent work has shown that automatic methods can optimize hyperparameters faster and even achieve better final performance. Sequential model-based optimization (SMBO) is the current state of the art framework for automatic hyperparameter optimization. Currently, it consists of three components: a surrogate model, an acquisition function and an initialization technique. We propose to add a fourth component, a way of pruning the hyperparameter search space which is a common way of accelerating the search in many domains but yet has not been applied to hyperparameter optimization. We propose to discard regions of the search space that are unlikely to contain better hyperparameter configurations by transferring knowledge from past experiments on other data sets as well as taking into account the evaluations already done on the current data set.Pruning as a new component for SMBO is an orthogonal contribution but nevertheless we compare it to surrogate models that learn across data sets and extensively investigate the impact of pruning with and without initialization for various state of the art surrogate models. The experiments are conducted on two newly created meta-data sets which we make publicly available. One of these meta-data sets is created on 59 data sets using 19 different classifiers resulting in a total of about 1.3 million experiments. This is by more than four times larger than all the results collaboratively collected by OpenML.

Demo
Authors:

André Lamúrias, Luka Clarke, Francisco Couto

Affiliation(s):
LASIGE
Faculty of Sciences, University of Lisbon

Abstract:
Automatic methods are being developed and applied to transform textual biomedical information into machine-readable formats.Machine learning techniques have been a prominent approach to this problem.However, there is still a lack of systems that are easily accessible to users.For this reason, we developed a web tool to facilitate the access to our text mining framework, IICE (Identifying Interactions between Chemical Entities).This tool annotates the input text with chemical entities and identifies the interactions described between these entities.Various options are available, which can be manipulated to control the algorithms employed by the framework and to the output formats.

Machine Learning
Authors:

Amit Dhurandhar, Karthik Sankarnarayanan

Affiliation(s):
IBM Thomas J Watson Research Center

Abstract:
In multiple domains, actively acquiring missing input information at a reasonable cost in order to improve our understanding of the input-output relationships is of increasing importance. This problem has gained prominence in healthcare, public policy making, education, and in the targeted advertising industry which tries to best match people to products. In this paper we tackle an important variant of this problem: Instance Completion, where we want to choose the best k incomplete instances to query from a much larger universe of N (>>k) incomplete instances so as to learn the most accurate classifier. We propose a principled framework which motivates a generally applicable yet efficient meta-technique for choosing k such instances. Since we cannot know a priori the classifier that will result from the completed dataset, i.e. the final classifier, our method chooses the k instances based on a derived upper bound on the expectation of the distance between the next classifier and the final classifier. We additionally derive a sufficient condition for these two solutions to match. We then empirically evaluate the performance of our method relative to the state-of-the-art methods on 4 UCI datasets as well as 3 proprietary e-commerce datasets used in previous studies. In these experiments, we also demonstrate how close we are likely to be to the optimal solution, by quantifying the extent to which our sufficient condition is satisfied. Lastly, we show that our method is easily extensible to the setting where we have a non uniform cost associated with acquiring the missing information.

Machine Learning
Authors:

Nikos Katzouris, Alexander Artikis, Georgios Paliouras

Affiliation(s):
Institute of Informatics and Telecommunications, NCSR "Demokritos"

Abstract:
Event recognition systems rely on knowledge bases of event definitions to infer occurrences of events in time. Using a logical framework for representing and reasoning about events offers direct connections to machine learning, via Inductive Logic Programming (ILP), thus allowing to avoid the tedious and error-prone task of manual knowledge construction. However, learning temporal logical formalisms, which are typically utilized by logic-based event recognition systems is a challenging task, which most ILP systems cannot fully undertake. In addition, event-based data is usually massive and collected at different times and under various circumstances. Ideally, systems that learn from temporal data should be able to operate in an incremental mode, that is, revise prior constructed knowledge in the face of new evidence. In this work we present an incremental method for learning and revising event-based knowledge, in the form of Event Calculus programs. The proposed algorithmrelies on abductive-inductive learning and comprises a scalable clause refinement methodology, based on a compressive summarization of clause coverage in a stream of examples. We present an empirical evaluation of our approach on real and synthetic data from activity recognition and city transport applications.

Scientific Track
Authors:
Yuxiao Dong, University of Notre Dame
Pinelli Fabio, IBM Reseach Center
Yiannis Gkoufas, IBM Research - Ireland
Zubair Nabi, IBM Research - Ireland
Francesco Calabrese, IBM Research - Ireland
Nitesh Chawla, University of Notre Dame

Abstract:
The pervasiveness and availability of mobile phone data offer the opportunity of discovering usable knowledge about crowd behavior in urban environments. Cities can leverage such knowledge to provide better services (e.g., public transport planning, optimized resource allocation) and safer environment. Call Detail Record (CDR) data represents a practical data source to detect and monitor unusual events considering the high level of mobile phone penetration, compared with GPS equipped and open devices. In this paper, we propose a methodology that is able to detect unusual events from CDR data, which typically has low accuracy in terms of space and time resolution. Moreover, we introduce a concept of unusual event that involves a large amount of people who expose an unusual mobility behavior. Our careful consideration of the issues that come from coarse-grained CDR data ultimately leads to a completely general framework that can detect unusual crowd events from CDR data effectively and efficiently. Through extensive experiments on real-world CDR data for a large city in Africa, we demonstrate that our method can detect unusual events with 16% higher recall and over 10 times higher precision, compared to state-of-the-art methods. We implement a visual analytics prototype system to help end users analyze detected unusual crowd events to best suit different application scenarios. To the best of our knowledge, this is the first work on the detection of unusual events from CDR data with considerations of its temporal and spatial sparseness and distinction between user unusual activities and daily routines.

Demo
Authors:

Matt McVicar, Cédric Mesnage, Jefrey Lijffijt, Tijl De Bie

Affiliation(s):
University of Bristol

Abstract:
We present an exploratory data mining tool useful for finding patterns in the geographic distribution of independent UK-based music artists. Our system is interactive, highly intuitive, and entirely browser-based, meaning it can be used without any additional software installations from any device. The target audiences are artists, other music professionals, and the general public. Potential uses of our software include highlighting discrepancies in supply and demand of specific music genres in different parts of the country, and identifying at a glance which areas have the highest densities of independent music artists.

Scientific Track
Authors:

Shaona Ghosh, Adam Prugel-Bennett

Affiliation(s):
University of Southampton

Abstract:
We develop an online learning algorithm for bandits on a graph with side information where there is an underlying Ising distribution over the vertices at low temperatures. We are motivated from practical settings where the graph state in a social or a computer hosts network (potentially) changes at every trial; intrinsically partitioning the graph thus requiring the learning algorithm to play the bandit from the current partition. Our algorithm essentially functions as a two stage process. In the first stage it uses "minimum-cut" as the regularity measure to compute the state of the network by using the side label received and acting as a graph classifier. The classifier internally uses a polynomial time linear programming relaxation technique that incorporates the known information to predict the unknown states. The second stage ensures that the bandits are sampled from the appropriate partition of the graph with the potential for exploring the other part. We achieve this by running the adversarial multi armed bandit for the edges in the current partition while exploring the "cut'' edges. We empirically evaluate the strength of our approach through synthetic and real world datasets. We also indicate the potential for a linear time exact algorithm for calculating the max-flow as an alternative to the linear programming relaxation, besides promising bounded mistakes/regret in the number of times the "cut'" changes.

Scientific Track
Authors:

Maria-Irina Nicolae, Gaussier Eric, Amaury Habrard, Marc Sebban

Affiliation(s):
Jean Monnet University
Universite Joseph Fourier

Abstract:
The importance of metrics in machine learning has attracted a growing interest for distance and similarity learning. We study here this problem in the situation where few labeled data (and potentially few unlabeled data as well) is available, a situation that arises in several practical contexts. We also provide a complete theoretical analysis of the proposed approach. It is indeed worth noting that the metric learning research field lacks theoretical guarantees that can be expected on the generalization capacity of the classifier associated to a learned metric. The theoretical framework of (ε,γ,τ)-good similarity functions has been one of the first attempts to draw a link between the properties of a similarity function and those of a linear classifier making use of it. In this paper, we extend this theory to a method where the metric and the separator are jointly learned in a semi-supervised way, setting that has not been explored before, and provide a theoretical analysis of this joint learning via Rademacher complexity. Experiments performed on standard datasets show the benefits of our approach over state-of-the-art methods.

Data Mining and Knowledge Discovery
Authors:

Yu Zhao, Sheng Gao, Patrick Gallinari, Jun Guo

Affiliation(s):
National Natural Science Foundation of China

Abstract:
Knowledge base consisting of triple like (subject entity, predicate relation,object entity) is a very important database for knowledge management. It is very useful for humanlike reasoning, query expansion, question answering (Siri) and other related AI tasks. However, knowledge base often suffers from incompleteness due to a large volume of increasing knowledge in the real world and a lack of reasoning capability. In this paper, we propose a Pairwise-interaction Differentiated Embeddings (PIDE) model to embed entities and relations in the knowledge base to low dimensional vector representations and then predict the possible truth of additional facts to extend the knowledge base. In addition, we present a probability-based objective function to improve the model optimization. Finally, we evaluate the model by considering the problem of computing how likely the additional triple is true for the task of knowledge base completion.Experiments on WordNet and Freebase dataset show the excellent performance of our model and algorithm.

Scientific Track
Authors:
Ziqiang Shi, Fujitsu Research & Development
Rujie Liu, Fujitsu Research & Development

Abstract:
In this work, we generalized and unified two recent completely different works of Jascha and Lee respectively into one by proposing the proximal stochastic Newton-type gradient (PROXTONE) method for optimizing the sums of two convex functions: one is the average of a huge number of smooth convex functions, and the other is a nonsmooth convex function. Our PROXTONE incorporates second order information to obtain stronger convergence results, that it achieves a linear convergence rate not only in the value of the objective function, but also for the solution. The proofs are simple and intuitive, and the results and technique can be served as a initiate for the research on the proximal stochastic methods that employ second order information. The methods and principles proposed in this paper can be used to do logistic regression, training of deep neural network and so on. Our numerical experiments shows that the PROXTONE achieves better computation performance than existing methods.

Scientific Track
Authors:

Duc Luu, Peng Lim

Affiliation(s):
Singapore Management Universit

Abstract:
Diffusion is an important dynamics that helps spreading informationwithin an online social network. While there are already numerous models for single item diffusion, few have studied diffusion of multiple items, especially when items can interact with one another due to their inter-similarity. Moreover, the well-known homophily effect is rarely considered explicitly in the existing diffusion models. This work therefore fills this gap by proposing a novel modelcalled Topic level Interaction Homophily Aware Diffusion (TIHAD) to include both latent factor level interaction among items and homophily factor in diffusion. The model determines item interaction based on latent factors and edge strengths based on homophily factor in the computation of social influence. An algorithm for training TIHAD model is also proposed. Our experiments on synthetic andreal datasets show that: (a) homophily increases diffusion significantly, and (b) item interaction at topic level boosts diffusion among similar items. A case study on hashtag diffusion in Twitter also shows that TIHAD outperforms the baseline model in the hashtag adoption prediction task.

Scientific Track
Authors:
Pengtao Xie, 1987

Abstract:
Learning a proper distance metric is of vital importance for many distance based applications. Distance metric learning aims to learn a set of latent factors based on which the distances between data points can be effectively measured. The number of latent factors incurs a tradeoff: a small amount of factors are not powerful and expressive enough to measure distances while a large number of factors cause high computational overhead. In this paper, we aim to achieve two seemingly conflicting goals: keeping the number of latent factors to be small for the sake of computational efficiency, meanwhile making them as effective as a large set of factors. The approach we take is to impose a diversity regularizer over the latent factors to encourage them to be uncorrelated, such that each factor can capture some unique information that is hard to be captured by other factors. In this way, a small amount of latent factors can be sufficient to capture a large proportion of information, which retains computational efficiency while preserving the effectiveness in measuring distances. Experiments on retrieval, clustering and classification demonstrate that a small amount of factors learned with diversity regularization can achieve comparable or even better performance compared with a large factor set learned without regularization.

Industrial Track
Authors:
Vojtech Franc, Czech Technical University in
Michal Sofka, CISCO
Karel Bartos, Cisco Systems

Abstract:
We address the problem of learning a detector of malicious behavior in network traffic. The malicious behavior is detected based on the analysis of network proxy logs that capture malware communication between client and server computers. The conceptual problem in using the standard supervised learning methods is the lack of sufficiently representative training set containing examples of malicious and legitimate communication. Annotation of individual proxy logs is an expensive process involving security experts and does not scale with constantly evolving malware. However, weak supervision can be achieved on the level of properly defined bags of proxy logs by leveraging internet domain black lists, security reports, and sandboxing analysis. We demonstrate that an accurate detector can be obtained from the collected security intelligence data by using a Multiple Instance Learning algorithm tailored to the Neyman-Pearson problem. We provide a thorough experimental evaluation on a large corpus of network communications collected from various company network environments.

Machine Learning
Authors:
Samaneh Khoshrou, INESC TEC
Jaime dos Santos Cardoso, INESC TEC
Luís Filipe Teixeira, Faculdade de Engenharia da Universidade do Porto (FEUP)

Abstract:
Nowadays, video surveillance systems are taking the first steps toward automation, in order to ease the burden on human resources as well as to avoid human error. As the underlying data distribution and the number of concepts change over time, the conventional learning algorithms fail to provide reliable solutions for this setting. Herein, we formalize a learning concept suitable for multi-camera video surveillance and propose a learning methodology adapted to that new paradigm. The proposed framework resorts to the universal background model to robustly learn individual object models from small samples and to more effectively detect novel classes. The individual models are incrementally updated in an ensemble based approach, with older models being progressively forgotten. The framework is designed to detect and label new concepts automatically. The system is also designed to exploit active learning strategies, in order to interact wisely with operator, requesting assistance in the most ambiguous to classify observations. The experimental results obtained both on real and synthetic data sets verify the usefulness of the proposed approach.

Scientific Track
Authors:
Guillaume Cleuziou, LIFO Université d'Orléans
Gaël Dias, GREYC

Abstract:
In this paper, we propose a new methodology for semi-supervised acquisition of lexical taxonomies. Our approach is based on the theory of pretopology that offers a powerful formalism to model semantic relations and transforms a list of terms into a structured term space by combining different discriminant criteria. In order to learn a parameterized pretopological space, we define the Learning Pretopological Spaces strategy based on genetic algorithms. In particular, rare but accurate pieces of knowledge are used to parameterize the different criteria defining the pretopological term space. Then, a structuring algorithm is used to transform the pretopological space into a lexical taxonomy. Results over three standard datasets evidence improved performances against state-of-the-art associative and pattern-based approaches.

Machine Learning
Rollover 2014
Authors:
Irma Ravkic, KU Leuven, Department of Computer Science
Jan Ramon, KU Leuven, Department of Computer Science
Jesse Davis, KU Leuven, Department of Computer Science

Abstract:
Statistical Relational Learning (SRL) is concerned with developing formalisms for representing and learning from data that exhibit both uncertainty and complex, relational structure. Most of the work in SRL has focused on modeling and learning from data that only contain discrete variables. As many important problems are characterized by the presence of both continuous and discrete variables, there has been a growing interest in developing hybrid SRL formalisms. Most of these formalisms focus on reasoning and representational issues and, in some cases, parameter learning. What has received little attention is learning the structure of a hybrid SRL model from data. In this paper, we fill that gap and make the following contributions. First, we propose Hybrid Relational Dependency Networks (HRDNs), an extension to Relational Dependency Networks that are able to model continuous variables. Second, we propose an algorithm for learning both the structure and parameters of an HRDN from data. Third, we provide an empirical evaluation that demonstrates that explicitly modeling continuous variables results in more accurate learned models than discretizing them prior to learning."

Nectar Track
Authors:
Markus Schedl, Johannes Kepler University

Abstract:
Music recommender systems are lately seeing a sharp increase in popularity due to many novel commercial music streaming services. Most systems, however, do not decently take their listeners into account when recommending music items. In this note, we summarize our recent work and report our latest findings on the topics of tailoring music recommendations to individual listeners and to groups of listeners sharing certain characteristics. We focus on two tasks: context-aware automatic playlist generation (also known as serial recommendation) using sensor data and music artist recommendation using social media data.

Nectar Track
Authors:

Ferilli Stefano, Esposito Floriana, Domenico Redavid

Affiliation(s):
Univesity of Bari Aldo Moro

Abstract:
Manually building process models is complex, costly and error-prone. Hence, the interest in process mining. Incremental adaptation of the models, and the ability to express/learn complex conditions on the involved tasks, are also desirable. First-order logic provides a single comprehensive and powerful framework for supporting all of the above. This paper presents a First-Order Logic incremental method for inferring process models. Its efficiency and effectiveness were proved with both controlled experiments and a real-world dataset.

Scientific Track
Authors:
Rohit Kumar, Université Libre de Bruxelles
Toon Calders, Université Libre de Bruxelles
Gionis Aristides, Aalto University
Tatti Nikolaj, Aalto University

Abstract:
Large networks are being generated by applications that keep track of relationships between different data entities. Examples include online social networks recording interactions between individuals, sensor networks logging information exchanges between sensors, and more. There is a large body of literature on computing exact or approximate properties on large networks, although most methods assume static net- works. On the other hand, in most modern real-world applications, net- works are highly dynamic and continuously interactions along existing connections are generated. Furthermore, it is desirable to consider that old edges become less important, and their contribution to the current view of the network diminishes over time.We study the problem of maintaining the neighborhood profile of each node in an interaction network. Maintaining such a profile has applications in modeling network evolution and monitoring the importance of the nodes of the network over time. We present an online streaming algorithm to maintain neighborhood profiles in the sliding-window model. The algorithm is highly scalable as it permits parallel processing and the computation is node centric, hence it scales easily to very large networks on a distributed system. We present results from both serial and parallel implementations of the algorithm for different social networks.

Scientific Track
Authors:

Konstantinos Sechidis, Gavin Brown

Affiliation(s):
University of Manchester

Abstract:
The importance of Markov blanket discovery algorithms is twofold: as the main building block in constraint-based structure learning of Bayesian network algorithms and as a technique to derive the optimal set of features in filter feature selection approaches. Equally, learning from partially labelled data is a crucial and demanding area of machine learning, and extending techniques from fully to partially supervised scenarios is a challenging problem. While there are many different algorithms to derive the Markov blanket of fully supervised nodes, the partially-labelled problem is far more challenging, and there is a lack of principled approaches in the literature. Our work derives a generalization of the conditional tests of independence for partially labelled binary target variables, which can handle the two main partially labelled scenarios: positive-unlabelled and semi-supervised. The result is a significantly deeper understanding of how to control false negative errors in Markov Blanket discovery procedures and how unlabelled data can help.

Data Mining and Knowledge Discovery
Authors:

Saket Navlakha, Christos Faloutsos, Ziv Bar-Joseph

Affiliation(s):
Machine Learning Department, School of Computer Science, Carnegie Mellon University

Abstract:
Consider networks in harsh environments, where nodes may be lost due to failure, attack, or infection | how is the topology a ffected by such events? Can we mimic and measure the e ffect? We propose a new generative model of network evolution in dynamic and harsh environments. Our model can reproduce the range of topologies observed across known robust and fragile biological networks, as well as several additional transport, communication, and social networks. We also develop a new optimization measure to evaluate robustness based on preserving high connectivity following random or adversarial bursty node loss. Using this measure, we evaluate the robustness of several real-world networks and propose a new distributed algorithm to construct secure networks operating within malicious environments.

Scientific Track
Authors:
Wojciech Czarnecki, Jagiellonian University
Rafal Józefowicz, Google
Jacek Tabor, Jagiellonian University

Abstract:
Representation learning is currently a very hot topic in modern machine learning, mostly due to the great success of the deep learning methods.In particular low-dimensional representation which discriminates classes can not only enhance the classification procedure, but also make it faster, while contrary to the high-dimensional embeddings can be efficiently used for visual based exploratory data analysis.In this paper we propose Maximum Entropy Linear Manifold (MELM), a multidimensional generalization of Multithreshold Entropy Linear Classifier model which is able to find a low-dimensional linear data projection maximizing discriminativeness of projected classes. As a result we obtain a linear embedding which can be used for classification, class aware dimensionality reduction and data visualization. MELM provides highly discriminative 2D projections of the data which can be used as a method for constructing robust classifiers.We provide both empirical evaluation as well as some interesting theoretical properties of our objective function such us scale and affine transformation invariance, connections with PCA and bounding of the expected balanced accuracy error.

Scientific Track
Authors:
Christian Knoll, Graz University of Technology
Michael Rath, www.sbox.tugraz.at
Sebastian Tschiatschek, www.inf.ethz.ch
Franz Pernkopf, Graz University of Technology

Abstract:
Approximate inference in large and densely connected graphical models is a challenging but highly relevant problem. Belief propagation, as a method for performing approximate inference in loopy graphs, has shown empirical success in many applications. However, convergence of belief propagation can only be guaranteed for simple graphs. Whether belief propagation converges depends strongly on the applied message update scheme, and specialized schemes can be highly beneficial. Yet, residual belief propagation is the only established method utilizing this fact to improve convergence properties. In experiments, we observe that residual belief propagation fails to converge if local oscillations occur and the same sequence of messages is repeatedly updated. To overcome this issue, we propose two novel message update schemes. In the first scheme we add noise to oscillating messages. In the second scheme we apply weight decay to gradually reduce the influence of these messages and consequently enforce convergence. Furthermore, in contrast to previous work, we consider the correctness of the obtained marginals and observe significant performance improvements when applying the proposed message update schemes to various Ising models with binary random variables.

Machine Learning
Authors:

Parthan Kasarapu, Lloyd Allison

Affiliation(s):
Monash University, Australia

Abstract:
Mixture modelling involves explaining some observed evidence using a combination of probability distributions. The crux of the problem is the inference of an optimal number of mixture components and their corresponding parameters. This paper discusses unsupervised learning of mixture models using the Bayesian Minimum Message Length (MML) criterion. To demonstrate the effectiveness of search and inference of mixture parameters using the proposed approach, we select two key probability distributions, each handling fundamentally different types of data: the multivariate Gaussian distribution to address mixture modelling of data distributed in Euclidean space, and the multivariate von Mises-Fisher (vMF) distribution to address mixture modelling of directional data distributed on a unit hypersphere. The key contributions of this paper, in addition to the general search and inference methodology, include the derivation of MML expressions for encoding the data using multivariate Gaussian and von Mises-Fisher distributions, and the analytical derivation of the MML estimates of the parameters of the two distributions. Our approach is tested on simulated and real world data sets. For instance, we infer vMF mixtures that concisely explain experimentally determined three dimensional protein conformations, providing an effective null model description of protein structures that is central to many inference problems in structural bioinformatics. The experimental results demonstrate that the performance of our proposed search and inference method along with the encoding schemes improve on the state of the art mixture modelling techniques.

Data Mining and Knowledge Discovery
Rollover 2014
Authors:
Lei Duan, Sichuan University
Guanting Tang, Simon Fraser University, Canada
Jian Pei, Simon Fraser University, Canada
James Bailey, The University of Melbourne, Australia
Akiko Campbell, Pacific Blue Cross, Canada
Changjie Tang, Sichuan University

Abstract:
When we are investigating an object in a data set, which itself may or may not be an outlier, can we identify unusual (i.e., outlying) aspects of the object? In this paper, we identify the novel problem of mining outlying aspects on numeric data. Given a query object o in a multidimensional numeric data set O, in which subspace is o most outlying? Technically, we use the rank of the probability density of an object in a subspace to measure the outlyingness of the object in the subspace. A minimal subspace where the query object is ranked the best is an outlying aspect. Computing the outlying aspects of a query object is far from trivial. A naïve method has to calculate the probability densities of all objects and rank them in every subspace, which is very costly when the dimensionality is high. We systematically develop a heuristic method that is capable of searching data sets with tens of dimensions efficiently. Our empirical study using both real data and synthetic data demonstrates that our method is effective and efficient.

Nectar Track
Authors:
Berlingerio Michele, IBM Research Ireland
Veli Bicer, IBM Research Ireland
Adi Botea, IBM Research Ireland
Stefano Braghin, IBM Research Ireland
Nuno Lopes, IBM Research Ireland
Riccardo Guidotti, KDD LAB - Univ. Pisa & ISTI-CNR
Francesca Pratesi, KDD LAB - Univ. Pisa

Abstract:
Smart Cities applications are fostering research in many fields including Computer Science and Engineering. Data Mining is used to support applications such as optimization of a public urban transit network, carpooling, event detection [2], and many more. Along these lines, the aim of the PErsonal TRansport Advisor (PETRA) EU FP7 project is to develop an integrated platform to supply urban travelers with smart journey and activity advises, on a multi-modal network, while taking into account uncertainty. Uncertainty is intrinsic in a transit network, and may come in different forms: delays in time of arrivals, impossibility to board a (full) bus, walking speed, as well as incidents, weather conditions, and so on. In this paper, we briefly describe the architecture of the PETRA platform, and present the results obtained by the embedded journey planner on thousands of planning requests, performed with and without the results coming from the Mobility Mining module. We show how, by integrating private transport routines into a public transit network, it is possible to devise better advises, measured both in terms of number of requests satisfied, and in terms of expected time of arrivals. These experiments are part of the validation for the PETRA use case on Rome, where we assess the quality of the advises coming from the innovative integrated platform.

Demo
Authors:

Ujval Kamath, Ana Costa e Silva, Michael O'Connell

Affiliation(s):
TIBCO Software Inc.

Abstract:
Within the context of Remote Equipment Monitoring in the area of an upstream oil and gas process, the goal of this project was to improve the efficiency of ESP (Electric Submersible Pump) oil & gas production, by predicting (rather than just reacting to) ESP shutdown and failure and thus avoiding downtime which results in a loss of production as well as repair costs.A methodology and solution for real-time monitoring of production equipment in remote locations is presented. The solution is developed on sensor data, transmitted from equipment to field information systems.The solution uses a combination of Tibco Spotfire, Streambase, and TERR (Tibco Enterprise Run-Time for R), and performs remarkably well, identifying a variety of anomalous equipment behaviour states, and preventing multiple shutdowns and pump failures, with false positive rates close to zero.

Scientific Track
Authors:
Niloofar Yousefi, University of Central Florida
Michael Georgiopoulos, UCF
Georgios Anagnostopoulos, Florida Institute of Technology

Abstract:
When faced with learning a set of inter-related tasks from a limitedamount of usable data, learning each task independently may lead to poor generalizationperformance. Multi-Task Learning (MTL) exploits the latent relationsbetween tasks and overcomes data scarcity limitations by co-learning all thesetasks simultaneously to offer improved performance. We propose a novel Multi-Task Multiple Kernel Learning framework based on Support Vector Machines forbinary classification tasks. By considering pair-wise task affinity in terms of similaritybetween a pair’s respective feature spaces, the new framework, comparedto other similar MTL approaches, offers a high degree of flexibility in determininghow similar feature spaces should be, as well as which pairs of tasks shouldshare a common feature space in order to benefit overall performance. The associatedoptimization problem is solved via a block coordinate descent, which employsa consensus-form Alternating Direction Method of Multipliers algorithmto optimize the Multiple Kernel Learning weights and, hence, to determine taskaffinities. Empirical evaluation on seven data sets exhibits a statistically significantimprovement of our framework’s results compared to the ones of severalother Clustered Multi-Task Learning methods.

Scientific Track
Authors:
Peng Luo, Northwest university of China
Jinye Peng, nwu of China
Ziyu Guan, nwu of China
Jianping Fan, www.nwu.edu.cn

Abstract:
Many real-world datasets are represented by multiple features or modalities which often provide compatible and complementary information to each other. In order to obtain a good data representation that synthesizes multiple features, researchers have proposed different multi-view subspace learning algorithms. Although label information has been exploited for guiding multi-view subspace learning, previous approaches either fail to directly capture the semantic relations between labeled items or unrealistically make Gaussian assumption about data distribution. In this paper, we propose a new multi-view nonnegative subspace learning algorithm called Multi-view Semantic Learning (MvSL). MvSL tries to capture the semantic structure of multi-view data by a novel graph embedding framework. The key idea is to let neighboring intra-class items be near each other while keep nearest inter-class items away from each other in the learned common subspace across multiple views. This nonparametric scheme can better model non-Gaussian data. To assess nearest neighbors in the multi-view context, we develop a multiple kernel learning method for obtaining a optimal kernel combination from multiple features. In addition, we encourage each latent dimension to be associated with a subset of views via sparseness constraints. In this way, MvSL is able to capture flexible conceptual patterns hidden in multi-view features. Experiments on two real-world datasets demonstrate the effectiveness of the proposed algorithm.

Scientific Track
Authors:
Adolfo Martinez-Uso, Tech University of Valencia
Hernandez-Orallo Jose, Polytechnic University of Valencia

Abstract:
Multidimensional data is systematically analysed at multiple granularities by applying aggregate and disaggregate operators (e.g., by the use of OLAP tools). For instance, in a supermarket we may want to predict sales of tomatoes for next week, but we may also be interested in predicting sales for all vegetables (higher up in the product hierarchy) for next Friday (lower down in the time dimension). While the domain and data are the same, the operating context is different.We explore several approaches for multidimensional data when predictions have to be made at different levels (or contexts) of aggregation. One method relies on the same resolution, another approach aggregates predictions bottom-up, a third approach disaggregates predictions top-down and a final technique corrects predictions using the relation between levels. We show how these strategies behave when the resolution context changes, using several machine learning techniques in four application domains.

Demo
Authors:

Matija Piškorec, Borut Sluban, Tomislav Šmuc

Affiliation(s):
Ruđer Bošković Institute
Jozef Stefan Institute

Abstract:
This paper presents MultiNets: a Javascript library for multilayer network visualization. MultiNets provides reusable HTML components with functions for loading, manipulation and visualization of multilayered networks. These components can be easily incorporated into any web page, and they allow users to perform exploratory analysis of multilayer networks and prepare publication quality network visualizations. MultiNets components are easily extendable to provide custom-based visualizations, such as embedding networks on geographical maps, and can be used for building complex web-based graphical user interfaces for data mining services that operate on multilayered networks and multirelational data in general.

Scientific Track
Authors:

Weixiang Shao, Lifang He, Philip Yu

Affiliation(s):
Univ. of Illinois at Chicago

Abstract:
With the advance of technology, data are often with multiple modalities or coming from multiple sources. Multi-view clustering provides a natural way for generating clusters from such data. Although multi-view clustering has been successfully applied in many applications, most of the previous methods assumed the completeness of each view (i.e., each instance appears in all views). However, in real-world applications, it is often the case that a number of views are available for learning but none of them is complete. The incompleteness of all the views and the number of available views make it difficult to integrate all the incomplete views and get a better clustering solution. In this paper, we propose MIC (Multi-Incomplete-view Clustering), an algorithm based on weighted nonnegative matrix factorization with L2,1 regularization. The proposed MIC works by learning the latent feature matrices for all the views and generating a consensus matrix so that the difference between each view and the consensus is minimized. MIC has several advantages comparing with other existing methods. First, MIC incorporates weighted nonnegative matrix factorization, which handles the missing instances in each incomplete view. Second, MIC uses a co-regularized approach, which pushes the learned latent feature matrices of all the views towards a common consensus. By regularizing the disagreement between the latent feature matrices and the consensus, MIC can be easily extended to more than two incomplete views. Third, MIC incorporates L2,1 regularization into the weighted nonnegative matrix factorization, which makes it robust to noises and outliers. Forth, an iterative optimization framework is used in MIC, which is scalable and proved to converge. Experiments on real datasets demonstrate the advantages of MIC.