Recently, graph neural networks (GNNs) have revolutionized the field of graphrepresentation learning through effectively learned node embeddings, andachieved state-of-the-art results in tasks such as node classification and linkprediction. However, current GNN methods are inherently flat and do not learnhierarchical representations of graphs---a limitation that is especiallyproblematic for the task of graph classification, where the goal is to predictthe label associated with an entire graph. Here we propose DiffPool, adifferentiable graph pooling module that can generate hierarchicalrepresentations of graphs and can be combined with various graph neural networkarchitectures in an end-to-end fashion. DiffPool learns a differentiable softcluster assignment for nodes at each layer of a deep GNN, mapping nodes to aset of clusters, which then form the coarsened input for the next GNN layer.Our experimental results show that combining existing GNN methods with DiffPoolyields an average improvement of 5-10% accuracy on graph classificationbenchmarks, compared to all existing pooling approaches, achieving a newstate-of-the-art on four out of five benchmark data sets.]]>

Representation learning provides new and powerful graph analytical approachesand tools for the highly valued data science challenge of mining knowledgegraphs. Since previous graph analytical methods have mostly focused onhomogeneous graphs, an important current challenge is extending thismethodology for richly heterogeneous graphs and knowledge domains. Thebiomedical sciences are such a domain, reflecting the complexity of biology,with entities such as genes, proteins, drugs, diseases, and phenotypes, andrelationships such as gene co-expression, biochemical regulation, andbiomolecular inhibition or activation. Therefore, the semantics of edges andnodes are critical for representation learning and knowledge discovery in realworld biomedical problems. In this paper, we propose the edge2vec model, whichrepresents graphs considering edge semantics. An edge-type transition matrix istrained by an Expectation-Maximization approach, and a stochastic gradientdescent model is employed to learn node embedding on a heterogeneous graph viathe trained transition matrix. edge2vec is validated on three biomedical domaintasks: biomedical entity classification, compound-gene bioactivity prediction,and biomedical information retrieval. Results show that by consideringedge-types into node embedding learning in heterogeneous graphs,\textbf{edge2vec}\ significantly outperforms state-of-the-art models on allthree tasks. We propose this method for its added value relative to existinggraph analytical methodology, and in the real world context of biomedicalknowledge discovery applicability.]]>

In this paper we describe a deep learning system that has been designed andbuilt for the WASSA 2017 Emotion Intensity Shared Task. We introduce arepresentation learning approach based on inner attention on top of an RNN.Results show that our model offers good capabilities and is able tosuccessfully identify emotion-bearing words to predict intensity withoutleveraging on lexicons, obtaining the 13th place among 22 shared taskcompetitors.]]>

We introduce a novel unsupervised domain adaptation approach for objectdetection. We aim to alleviate the imperfect translation problem of pixel-leveladaptations, and the source-biased discriminativity problem of feature-leveladaptations simultaneously. Our approach is composed of two stages, i.e.,Domain Diversification (DD) and Multi-domain-invariant Representation Learning(MRL). At the DD stage, we diversify the distribution of the labeled data bygenerating various distinctive shifted domains from the source domain. At theMRL stage, we apply adversarial learning with a multi-domain discriminator toencourage feature to be indistinguishable among the domains. DD addresses thesource-biased discriminativity, while MRL mitigates the imperfect imagetranslation. We construct a structured domain adaptation framework for ourlearning paradigm and introduce a practical way of DD for implementation. Ourmethod outperforms the state-of-the-art methods by a large margin of 3%~11% interms of mean average precision (mAP) on various datasets.]]>

Text segmentation plays an important role in various Natural LanguageProcessing (NLP) tasks like summarization, context understanding, documentindexing and document noise removal. Previous methods for this task requiremanual feature engineering, huge memory requirements and large execution times.To the best of our knowledge, this paper is the first one to present a novelsupervised neural approach for text segmentation. Specifically, we propose anattention-based bidirectional LSTM model where sentence embeddings are learnedusing CNNs and the segments are predicted based on contextual information. Thismodel can automatically handle variable sized context information. Compared tothe existing competitive baselines, the proposed model shows a performanceimprovement of ~7% in WinDiff score on three benchmark datasets.]]>

Visual relations, such as "person ride bike" and "bike next to car", offer acomprehensive scene understanding of an image, and have already shown theirgreat utility in connecting computer vision and natural language. However, dueto the challenging combinatorial complexity of modelingsubject-predicate-object relation triplets, very little work has been done tolocalize and predict visual relations. Inspired by the recent advances inrelational representation learning of knowledge bases and convolutional objectdetection networks, we propose a Visual Translation Embedding network (VTransE)for visual relation detection. VTransE places objects in a low-dimensionalrelation space where a relation can be modeled as a simple vector translation,i.e., subject + predicate $\approx$ object. We propose a novel featureextraction layer that enables object-relation knowledge transfer in afully-convolutional fashion that supports training and inference in a singleforward/backward pass. To the best of our knowledge, VTransE is the firstend-to-end relation detection network. We demonstrate the effectiveness ofVTransE over other state-of-the-art methods on two large-scale datasets: VisualRelationship and Visual Genome. Note that even though VTransE is a purelyvisual model, it is still competitive to the Lu's multi-modal model withlanguage priors.]]>

In this paper we consider two sequence tagging tasks for medieval Latin:part-of-speech tagging and lemmatization. These are both basic, yetfoundational preprocessing steps in applications such as text re-use detection.Nevertheless, they are generally complicated by the considerable orthographicvariation which is typical of medieval Latin. In Digital Classics, these tasksare traditionally solved in a (i) cascaded and (ii) lexicon-dependent fashion.For example, a lexicon is used to generate all the potential lemma-tag pairsfor a token, and next, a context-aware PoS-tagger is used to select the mostappropriate tag-lemma pair. Apart from the problems with out-of-lexicon items,error percolation is a major downside of such approaches. In this paper weexplore the possibility to elegantly solve these tasks using a single,integrated approach. For this, we make use of a layered neural networkarchitecture from the field of deep representation learning.]]>

It has been shown that for automated PAP-smear image classification, nucleusfeatures can be very informative. Therefore, the primary step for automatedscreening can be cell-nuclei detection followed by segmentation of nuclei inthe resulting single cell PAP-smear images. We propose a patch based approachusing CNN for segmentation of nuclei in single cell images. We then pose thequestion of ion of segmentation for classification using representationlearning with CNN, and whether low-level CNN features may be useful forclassification. We suggest a CNN-based feature level analysis and a transferlearning based approach for classification using both segmented as well fullsingle cell images. We also propose a decision-tree based approach forclassification. Experimental results demonstrate the effectiveness of theproposed algorithms individually (with low-level CNN features), andsimultaneously proving the sufficiency of cell-nuclei detection (rather thanaccurate segmentation) for classification. Thus, we propose a system foranalysis of multi-cell PAP-smear images consisting of a simple nuclei detectionalgorithm followed by classification using transfer learning.]]>

Deep RL approaches build much of their success on the ability of the deepneural network to generate useful internal representations. Nevertheless, theysuffer from a high sample-complexity and starting with a good inputrepresentation can have a significant impact on the performance. In this paper,we exploit the fact that the underlying Markov decision process (MDP)represents a graph, which enables us to incorporate the topological informationfor effective state representation learning.Motivated by the recent success of node representations for several graphanalytical tasks we specifically investigate the capability of noderepresentation learning methods to effectively encode the topology of theunderlying MDP in Deep RL. To this end we perform a comparative analysis ofseveral models chosen from 4 different classes of representation learningalgorithms for policy learning in grid-world navigation tasks, which arerepresentative of a large class of RL problems. We find that all embeddingmethods outperform the commonly used matrix representation of grid-worldenvironments in all of the studied cases. Moreoever, graph convolution basedmethods are outperformed by simpler random walk based methods and graph linearautoencoders.]]>

The analysis of mixed data has been raising challenges in statistics andmachine learning. One of two most prominent challenges is to develop newstatistical techniques and methodologies to effectively handle mixed data bymaking the data less heterogeneous with minimum loss of information. The otherchallenge is that such methods must be able to apply in large-scale tasks whendealing with huge amount of mixed data. To tackle these challenges, weintroduce parameter sharing and balancing extensions to our recent model, themixed-variate restricted Boltzmann machine (MV.RBM) which can transformheterogeneous data into homogeneous representation. We also integratestructured sparsity and distance metric learning into RBM-based models. Ourproposed methods are applied in various applications including latent patientprofile modelling in medical data analysis and representation learning forimage retrieval. The experimental results demonstrate the models perform betterthan baseline methods in medical data and outperform state-of-the-art rivals inimage dataset.]]>

Knowledge graph embedding methods learn continuous vector representations forentities in knowledge graphs and have been used successfully in a large numberof applications. We present a novel and scalable paradigm for the computationof knowledge graph embeddings, which we dub PYKE . Our approach combines aphysical model based on Hooke's law and its inverse with ideas from simulatedannealing to compute embeddings for knowledge graphs efficiently. We prove thatPYKE achieves a linear space complexity. While the time complexity for theinitialization of our approach is quadratic, the time complexity of each of itsiterations is linear in the size of the input knowledge graph. Hence, PYKE'soverall runtime is close to linear. Consequently, our approach easily scales upto knowledge graphs containing millions of triples. We evaluate our approachagainst six state-of-the-art embedding approaches on the DrugBank and DBpediadatasets in two series of experiments. The first series shows that the clusterpurity achieved by PYKE is up to 26% (absolute) better than that of the stateof art. In addition, PYKE is more than 22 times faster than existing embeddingsolutions in the best case. The results of our second series of experimentsshow that PYKE is up to 23% (absolute) better than the state of art on the taskof type prediction while maintaining its superior scalability. Ourimplementation and results are open-source and are available atthis http URL.]]>

As a highlighting research topic in the multimedia area, cross-mediaretrieval aims to capture the complex correlations among multiple media types.Learning better shared representation and distance metric for multimedia datais important to boost the cross-media retrieval. Motivated by the strongability of deep neural network in feature representation and comparisonfunctions learning, we propose the Unified Network for Cross-media SimilarityMetric (UNCSM) to associate cross-media shared representation learning withdistance metric in a unified framework. First, we design a two-pathway deepnetwork pretrained with contrastive loss, and employ double triplet similarityloss for fine-tuning to learn the shared representation for each media type bymodeling the relative semantic similarity. Second, the metric network isdesigned for effectively calculating the cross-media similarity of the sharedrepresentation, by modeling the pairwise similar and dissimilar constraints.Compared to the existing methods which mostly ignore the dissimilar constraintsand only use sample distance metric as Euclidean distance separately, our UNCSMapproach unifies the representation learning and distance metric to preservethe relative similarity as well as embrace more complex similarity functionsfor further improving the cross-media retrieval accuracy. The experimentalresults show that our UNCSM approach outperforms 8 state-of-the-art methods on4 widely-used cross-media datasets.]]>

An important problem in multiview representation learning is finding theoptimal combination of views with respect to the specific task at hand. To thisend, we introduce NAM: a Neural Attentive Multiview machine that learnsmultiview item representations and similarity by employing a novel attentionmechanism. NAM harnesses multiple information sources and automaticallyquantifies their relevancy with respect to a supervised task. Finally, a verypractical advantage of NAM is its robustness to the case of dataset withmissing views. We demonstrate the effectiveness of NAM for the task of moviesand app recommendations. Our evaluations indicate that NAM outperforms singleview models as well as alternative multiview methods on item recommendationstasks, including cold-start scenarios.]]>

Graph convolutional networks (GCNs) are potentially short of the ability to learn hierarchical representation for graph embedding, which holds them back in the graph classification task. Here, we propose AttPool, which is a novel graph pooling module based on attention mechanism, to remedy the problem... (read more)]]>

The (variational) graph auto-encoder and its variants have been popularlyused for representation learning on graph-structured data. While the encoder isoften a powerful graph convolutional network, the decoder reconstructs thegraph structure by only considering two nodes at a time, thus ignoring possibleinteractions among edges. On the other hand, structured prediction, whichconsiders the whole graph simultaneously, is computationally expensive. In thispaper, we utilize the well-known triadic closure property which is exhibited inmany real-world networks. We propose the triad decoder, which considers andpredicts the three edges involved in a local triad together. The triad decodercan be readily used in any graph-based auto-encoder. In particular, weincorporate this to the (variational) graph auto-encoder. Experiments on linkprediction, node clustering and graph generation show that the use of triadsleads to more accurate prediction, clustering and better preservation of thegraph characteristics.]]>

Genetic Programming (GP) is an evolutionary algorithm commonly used formachine learning tasks. In this paper we present a method that allows GP totransform the representation of a large-scale machine learning dataset into amore compact representation, by means of processing features from the originalrepresentation at individual level. We develop as a proof of concept of thismethod an autoencoder. We tested a preliminary version of our approach in avariety of well-known machine learning image datasets. We speculate that thismethod, used in an iterative manner, can produce results competitive withstate-of-art deep neural networks.]]>

This work proposes a novel algorithm to generate natural language adversarialinput for text classification models, in order to investigate the robustness ofthese models. It involves applying gradient-based perturbation on the sentenceembeddings that are used as the features for the classifier, and learning adecoder for generation. We employ this method to a sentiment analysis model andverify its effectiveness in inducing incorrect predictions by the model. Wealso conduct quantitative and qualitative analysis on these examples anddemonstrate that our approach can generate more natural adversaries. Inaddition, it can be used to successfully perform black-box attacks, whichinvolves attacking other existing models whose parameters are not known. On apublic sentiment analysis API, the proposed method introduces a 20% relativedecrease in average accuracy and 74% relative increase in absolute error.]]>

3D multi object generative models allow us to synthesize a large range ofnovel 3D multi object scenes and also identify objects, shapes, layouts andtheir positions. But multi object scenes are difficult to create because of thedataset being multimodal in nature. The conventional 3D generative adversarialmodels are not efficient in generating multi object scenes, they usually tendto generate either one object or generate fuzzy results of multiple objects.Auto-encoder models have much scope in feature extraction and representationlearning using the unsupervised paradigm in probabilistic spaces. We try tomake use of this property in our proposed model. In this paper we propose anovel architecture using 3DConvNets trained with the progressive trainingparadigm that has been able to generate realistic high resolution 3D scenes ofrooms, bedrooms, offices etc. with various pieces of furniture and objects. Wemake use of the adversarial auto-encoder along with the WGAN-GP loss parameterin our discriminator loss function. Finally this new approach to multi objectscene generation has also been able to generate more number of objects perscene.]]>

This is a work-in-progress report, which aims to share preliminary results ofa novel sequence-to-sequence schema for dependency parsing that relies on acombination of a BiLSTM and two Pointer Networks (Vinyals et al., 2015), inwhich the final softmax function has been replaced with the logisticregression. The two pointer networks co-operate to develop a latent syntacticknowledge, by learning the lexical properties of "selection" and the lexicalproperties of "selectability", respectively. At the moment and withoutfine-tuning, the parser implementation gets a UAS of 93.14% on the EnglishPenn-treebank (Marcus et al., 1993) annotated with Stanford Dependencies: 2-3%under the SOTA but yet attractive as a baseline of the approach.]]>

In this paper, we present a novel method of no-reference image qualityassessment (NR-IQA), which is to predict the perceptual quality score of agiven image without using any reference image. The proposed method harnessesthree functions (i) the visual attention mechanism, which affects many aspectsof visual perception including image quality assessment, however, is overlookedin the NR-IQA literature. The method assumes that the fixation areas on animage contain key information to the process of IQA. (ii) the robust averagingstrategy, which is a means \--- supported by psychology studies \--- tointegrating multiple/step-wise evidence to make a final perceptual judgment.(iii) the multi-task learning, which is believed to be an effectual means toshape representation learning and could result in a more generalized model.To exploit the synergy of the three, we consider the NR-IQA as a dynamicperception process, in which the model samples a sequence of "informative"areas and aggregates the information to learn a representation for the tasks ofjointly predicting the image quality score and the distortion type.The model learning is implemented by a reinforcement strategy, in which therewards of both tasks guide the learning of the optimal sampling policy toacquire the "task-informative" image regions so that the predictions can bemade accurately and efficiently (in terms of the sampling steps). Thereinforcement learning is realized by a deep network with the policy gradientmethod and trained through back-propagation.In experiments, the model is tested on the TID2008 dataset and it outperformsseveral state-of-the-art methods. Furthermore, the model is very efficient inthe sense that a small number of fixations are used in NR-IQA.]]>

The vast majority of visual animals actively control their eyes, heads,and/or bodies to direct their gaze toward different parts of their environment.In contrast, recent applications of reinforcement learning in roboticmanipulation employ cameras as passive sensors. These are carefully placed toview a scene from a fixed pose. Active perception allows animals to gather themost relevant information about the world and focus their computationalresources where needed. It also enables them to view objects from differentdistances and viewpoints, providing a rich visual experience from which tolearn abstract representations of the environment. Inspired by the primatevisual-motor system, we present a framework that leverages the benefits ofactive perception to accomplish manipulation tasks. Our agent uses viewpointchanges to localize objects, to learn state representations in aself-supervised manner, and to perform goal-directed actions. We apply ourmodel to a simulated grasping task with a 6-DoF action space. Compared to itspassive, fixed-camera counterpart, the active model achieves 8% betterperformance in targeted grasping. Compared to vanilla deep Q-learningalgorithms, our model is at least four times more sample-efficient,highlighting the benefits of both active perception and representationlearning.]]>

In this paper we consider self-supervised representation learning to improvesample efficiency in reinforcement learning (RL). We propose a forwardprediction objective for simultaneously learning embeddings of states andaction sequences. These embeddings capture the structure of the environment'sdynamics, enabling efficient policy learning. We demonstrate that our actionembeddings alone improve the sample efficiency and peak performance ofmodel-free RL on control from low-dimensional states. By combining state andaction embeddings, we achieve efficient learning of high-quality policies ongoal-conditioned continuous control from pixel observations in only 1-2 millionenvironment steps.]]>

Cross-lingual word vectors are typically obtained by fitting an orthogonalmatrix that maps the entries of a bilingual dictionary from a source to atarget vector space. Word vectors, however, are most commonly used for sentenceor document-level representations that are calculated as the weighted averageof word embeddings. In this paper, we propose an alternative to word-levelmapping that better reflects sentence-level cross-lingual similarity. Weincorporate context in the transformation matrix by directly mapping theaveraged embeddings of aligned sentences in a parallel corpus. We alsoimplement cross-lingual mapping of deep contextualized word embeddings usingparallel sentences with word alignments. In our experiments, both approachesresulted in cross-lingual sentence embeddings that outperformedcontext-independent word mapping in sentence translation retrieval.Furthermore, the sentence-level transformation could be used for word-levelmapping without loss in word translation quality.]]>

Recent improvements in Generative Adversarial Neural Networks (GANs) haveshown their ability to generate higher quality samples as well as to learn goodrepresentations for transfer learning. Most of the representation learningmethods based on GANs learn representations ignoring their post-use scenario,which can lead to increased generalisation ability. However, the model canbecome redundant if it is intended for a specific task. For example, assume wehave a vast unlabelled audio dataset, and we want to learn a representationfrom this dataset so that it can be used to improve the emotion recognitionperformance of a small labelled audio dataset. During the representationlearning training, if the model does not know the post emotion recognitiontask, it can completely ignore emotion-related characteristics in the learntrepresentation. This is a fundamental challenge for any unsupervisedrepresentation learning model. In this paper, we aim to address this challengeby proposing a novel GAN framework: Guided Generative Neural Network (GGAN),which guides a GAN to focus on learning desired representations and generatingsuperior quality samples for audio data leveraging fewer labelled samples.Experimental results show that using a very small amount of labelled data asguidance, a GGAN learns significantly better representations.]]>

The past decade has witnessed the rapid development of feature representationlearning and distance metric learning, whereas the two steps are oftendiscussed separately. To explore their interaction, this work proposes anend-to-end learning framework called DARI, i.e. Distance metric AndRepresentation Integration, and validates the effectiveness of DARI in thechallenging task of person verification. Given the training images annotatedwith the labels, we first produce a large number of triplet units, and each onecontains three images, i.e. one person and the matched/mismatch references. Foreach triplet unit, the distance disparity between the matched pair and themismatched pair tends to be maximized. We solve this objective by building adeep architecture of convolutional neural networks. In particular, theMahalanobis distance matrix is naturally factorized as one top fully-connectedlayer that is seamlessly integrated with other bottom layers representing theimage feature. The image feature and the distance metric can be thussimultaneously optimized via the one-shot backward propagation. On severalpublic datasets, DARI shows very promising performance on re-identifyingindividuals cross cameras against various challenges, and outperforms otherstate-of-the-art approaches.]]>

Recent studies show that deep neural networks are vulnerable to adversarialexamples which can be generated via certain types of transformations. Beingrobust to a desired family of adversarial attacks is then equivalent to beinginvariant to a family of transformations. Learning invariant representationsthen naturally emerges as an important goal to achieve which we explore in thispaper within specific application contexts. Specifically, we propose acyclically-trained adversarial network to learn a mapping from image space tolatent representation space and back such that the latent representation isinvariant to a specified factor of variation (e.g., identity). The learnedmapping assures that the synthesized image is not only realistic, but has thesame values for unspecified factors (e.g., pose and illumination) as theoriginal image and a desired value of the specified factor. Unlike disentangledrepresentation learning, which requires two latent spaces, one for specifiedand another for unspecified factors, invariant representation learning needsonly one such space. We encourage invariance to a specified factor by applyingadversarial training using a variational autoencoder in the image space asopposed to the latent space. We strengthen this invariance by introducing acyclic training process (forward and backward cycle). We also propose a newmethod to evaluate conditional generative networks. It compares how welldifferent factors of variation can be predicted from the synthesized, asopposed to real, images. In quantitative terms, our approach attainsstate-of-the-art performance in experiments spanning three datasets withfactors such as identity, pose, illumination or style. Our method producessharp, high-quality synthetic images with little visible artefacts compared toprevious approaches.]]>

We implement a method for re-ranking top-10 results of a state-of-the-artquestion answering (QA) system. The goal of our re-ranking approach is toimprove the answer selection given the user question and the top-10 candidates.We focus on improving deployed QA systems that do not allow re-training orre-training comes at a high cost. Our re-ranking approach learns a similarityfunction using n-gram based features using the query, the answer and theinitial system confidence as input. Our contributions are: (1) we generate a QAtraining corpus starting from 877 answers from the customer care domain ofT-Mobile Austria, (2) we implement a state-of-the-art QA pipeline using neuralsentence embeddings that encode queries in the same space than the answerindex, and (3) we evaluate the QA pipeline and our re-ranking approach using aseparately provided test set. The test set can be considered to be availableafter deployment of the system, e.g., based on feedback of users. Our resultsshow that the system performance, in terms of top-n accuracy and the meanreciprocal rank, benefits from re-ranking using gradient boosted regressiontrees. On average, the mean reciprocal rank improves by 9.15%.]]>

BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a newstate-of-the-art performance on sentence-pair regression tasks like semantictextual similarity (STS). However, it requires that both sentences are fed intothe network, which causes a massive computational overhead: Finding the mostsimilar pair in a collection of 10,000 sentences requires about 50 millioninference computations (~65 hours) with BERT. The construction of BERT makes itunsuitable for semantic similarity search as well as for unsupervised taskslike clustering.In this publication, we present Sentence-BERT (SBERT), a modification of thepretrained BERT network that use siamese and triplet network structures toderive semantically meaningful sentence embeddings that can be compared usingcosine-similarity. This reduces the effort for finding the most similar pairfrom 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, whilemaintaining the accuracy from BERT.We evaluate SBERT and SRoBERTa on common STS tasks and transfer learningtasks, where it outperforms other state-of-the-art sentence embeddings methods.]]>

In a dynamic network, the neighborhood of the vertices evolve acrossdifferent temporal snapshots of the network. Accurate modeling of this temporalevolution can help solve complex tasks involving real-life social andinteraction networks. However, existing models for learning latentrepresentation are inadequate for obtaining the representation vectors of thevertices for different time-stamps of a dynamic network in a meaningful way. Inthis paper, we propose latent representation learning models for dynamicnetworks which overcome the above limitation by considering two different kindsof temporal smoothness: (i) retrofitted, and (ii) linear transformation. Theretrofitted model tracks the representation vector of a vertex over time,facilitating vertex-based temporal analysis of a network. On the other hand,linear transformation based model provides a smooth transition operator whichmaps the representation vectors of all vertices from one temporal snapshot tothe next (unobserved) snapshot-this facilitates prediction of the state of anetwork in a future time-stamp. We validate the performance of our proposedmodels by employing them for solving the temporal link prediction task.Experiments on 9 real-life networks from various domains validate that theproposed models are significantly better than the existing models forpredicting the dynamics of an evolving network.]]>

To improve the ability of VAE to disentangle in the latent space, existingworks mostly focus on enforcing independence among the learned latent factors.However, the ability of these models to disentangle often decreases as thecomplexity of the generative factors increases. In this paper, we investigatethe little-explored effect of the modeling capacity of a posterior density onthe disentangling ability of the VAE. We note that the independence within andthe complexity of the latent density are two different properties we constrainwhen regularizing the posterior density: while the former promotes thedisentangling ability of VAE, the latter -- if overly limited -- creates anunnecessary competition with the data reconstruction objective in VAE.Therefore, if we preserve the independence but allow richer modeling capacityin the posterior density, we will lift this competition and thereby allowimproved independence and data reconstruction at the same time. We investigatethis theoretical intuition with a VAE that utilizes a non-parametric latentfactor model, the Indian Buffet Process (IBP), as a latent density that is ableto grow with the complexity of the data. Across three widely-used benchmarkdata sets and two clinical data sets little explored for disentangled learning,we qualitatively and quantitatively demonstrated the improved disentanglingperformance of IBP-VAE over the state of the art. In the latter two clinicaldata sets riddled with complex factors of variations, we further demonstratedthat unsupervised disentangling of nuisance factors via IBP-VAE -- whencombined with a supervised objective -- can not only improve task accuracy incomparison to relevant supervised deep architectures but also facilitateknowledge discovery related to task decision-making. A shorter version of thiswork will appear in the ICDM 2019 conference proceedings.]]>

In this paper we present a model for unsupervised topic discovery in textscorpora. The proposed model uses documents, words, and topics lookup tableembedding as neural network model parameters to build probabilities of wordsgiven topics, and probabilities of topics given documents. These probabilitiesare used to recover by marginalization probabilities of words given documents.For very large corpora where the number of documents can be in the order ofbillions, using a neural auto-encoder based document embedding is more scalablethen using a lookup table embedding as classically done. We thus extended thelookup based document embedding model to continuous auto-encoder based model.Our models are trained using probabilistic latent semantic analysis (PLSA)assumptions. We evaluated our models on six datasets with a rich variety ofcontents. Conducted experiments demonstrate that the proposed neural topicmodels are very effective in capturing relevant topics. Furthermore,considering perplexity metric, conducted evaluation benchmarks show that ourtopic models outperform latent Dirichlet allocation (LDA) model which isclassically used to address topic discovery tasks.]]>

Rapid and massive adoption of mobile/ online payment services has brought newchallenges to the service providers as well as regulators in safeguarding theproper uses such services/ systems. In this paper, we leverage recent advancesin deep-neural-network-based graph representation learning to detect abnormal/suspicious financial transactions in real-world e-payment networks. Inparticular, we propose an end-to-end Graph Convolution Network (GCN)-basedalgorithm to learn the embeddings of the nodes and edges of a large-scaletime-evolving graph. In the context of e-payment transaction graphs, theresultant node and edge embeddings can effectively characterize theuser-background as well as the financial transaction patterns of individualaccount holders. As such, we can use the graph embedding results to drivedownstream graph mining tasks such as node-classification to identify illicitaccounts within the payment networks. Our algorithm outperformsstate-of-the-art schemes including GraphSAGE, Gradient Boosting Decision Treeand Random Forest to deliver considerably higher accuracy (94.62% and 86.98%respectively) in classifying user accounts within 2 practical e-paymenttransaction datasets. It also achieves outstanding accuracy (97.43%) foranother biomedical entity identification task while using only edge-relatedinformation.]]>

Multi-label image and video classification are fundamental yet challengingtasks in computer vision. The main challenges lie in capturing spatial ortemporal dependencies between labels and discovering the locations ofdiscriminative features for each class. In order to overcome these challenges,we propose to use cross-modality attention with semantic graph embedding formulti label classification. Based on the constructed label graph, we propose anadjacency-based similarity graph embedding method to learn semantic labelembeddings, which explicitly exploit label relationships. Then our novelcross-modality attention maps are generated with the guidance of learned labelembeddings. Experiments on two multi-label image classification datasets(MS-COCO and NUS-WIDE) show our method outperforms other existingstate-of-the-arts. In addition, we validate our method on a large multi-labelvideo classification dataset (YouTube-8M Segments) and the evaluation resultsdemonstrate the generalization capability of our method.]]>

Paragraph Vectors has been recently proposed as an unsupervised method forlearning distributed representations for pieces of texts. In their work, theauthors showed that the method can learn an embedding of movie review textswhich can be leveraged for sentiment analysis. That proof of concept, whileencouraging, was rather narrow. Here we consider tasks other than sentimentanalysis, provide a more thorough comparison of Paragraph Vectors to otherdocument modelling algorithms such as Latent Dirichlet Allocation, and evaluateperformance of the method as we vary the dimensionality of the learnedrepresentation. We benchmarked the models on two document similarity data sets,one from Wikipedia, one from arXiv. We observe that the Paragraph Vector methodperforms significantly better than other methods, and propose a simpleimprovement to enhance embedding quality. Somewhat surprisingly, we also showthat much like word embeddings, vector operations on Paragraph Vectors canperform useful semantic results.]]>

Recent advancements in graph representation learning have led to theemergence of condensed encodings that capture the main properties of a graph.However, even though these abstract representations are powerful for downstreamtasks, they are not equally suitable for visualisation purposes. In this work,we merge Mapper, an algorithm from the field of Topological Data Analysis(TDA), with the expressive power of Graph Neural Networks (GNNs) to producehierarchical, topologically-grounded visualisations of graphs. Thesevisualisations do not only help discern the structure of complex graphs butalso provide a means of understanding the models applied to them for solvingvarious tasks. We further demonstrate the suitability of Mapper as atopological framework for graph pooling by mathematically proving anequivalence with Min-Cut and Diff Pool. Building upon this framework, weintroduce a novel pooling algorithm based on PageRank, which obtainscompetitive results with state of the art methods on graph classificationbenchmarks.]]>

Learning representations that can disentangle explanatory attributesunderlying the data improves interpretabilty as well as provides control ondata generation. Various learning frameworks such as VAEs, GANs andauto-encoders have been used in the literature to learn such representations.Most often, the latent space is constrained to a partitioned representation orstructured by a prior to impose disentangling. In this work, we advance the useof a latent representation based on a product space of Orthogonal SpheresPrOSe. The PrOSe model is motivated by the reasoning that latent-variablesrelated to the physics of image-formation can under certain relaxed assumptionslead to spherical-spaces. Orthogonality between the spheres is motivated viaphysical independence models. Imposing the orthogonal-sphere constraint is muchsimpler than other complicated physical models, is fairly general and flexible,and extensible beyond the factors used to motivate its development. Underfurther relaxed assumptions of equal-sized latent blocks per factor, theconstraint can be written down in closed form as an ortho-normality term in theloss function. We show that our approach improves the quality ofdisentanglement significantly. We find consistent improvement indisentanglement compared to several state-of-the-art approaches, across severalbenchmarks and metrics.]]>

Objective: Epilepsy is a chronic neurological disorder characterized by theoccurrence of spontaneous seizures, which affects about one percent of theworld's population. Most of the current seizure detection approaches stronglyrely on patient history records and thus fail in the patient-independentsituation of detecting the new patients. To overcome such limitation, wepropose a robust and explainable epileptic seizure detection model thateffectively learns from seizure states while eliminates the inter-patientnoises. Methods: A complex deep neural network model is proposed to learn thepure seizure-specific representation from the raw non-invasiveelectroencephalography (EEG) signals through adversarial training. Furthermore,to enhance the explainability, we develop an attention mechanism toautomatically learn the importance of each EEG channels in the seizurediagnosis procedure. Results: The proposed approach is evaluated over theTemple University Hospital EEG (TUH EEG) database. The experimental resultsillustrate that our model outperforms the competitive state-of-the-artbaselines with low latency. Moreover, the designed attention mechanism isdemonstrated ables to provide fine-grained information for pathologicalanalysis. Conclusion and significance: We propose an effective and efficientpatient-independent diagnosis approach of epileptic seizure based on raw EEGsignals without manually feature engineering, which is a step toward thedevelopment of large-scale deployment for real-life use.]]>

Learning distributed representations for nodes in graphs is a crucialprimitive in network analysis with a wide spectrum of applications. Lineargraph embedding methods learn such representations by optimizing the likelihoodof both positive and negative edges while constraining the dimension of theembedding vectors. We argue that the generalization performance of thesemethods is not due to the dimensionality constraint as commonly believed, butrather the small norm of embedding vectors. Both theoretical and empiricalevidence are provided to support this argument: (a) we prove that thegeneralization error of these methods can be bounded by limiting the norm ofvectors, regardless of the embedding dimension; (b) we show that thegeneralization performance of linear graph embedding methods is correlated withthe norm of embedding vectors, which is small due to the early stopping of SGDand the vanishing gradients. We performed extensive experiments to validate ouranalysis and showcased the importance of proper norm regularization inpractice.]]>

We study feature propagation on graph, an inference process involved in graphrepresentation learning tasks. It's to spread the features over the whole graphto the $t$-th orders, thus to expand the end's features. The process has beensuccessfully adopted in graph embedding or graph neural networks, however fewworks studied the convergence of feature propagation. Without convergenceguarantees, it may lead to unexpected numerical overflows and task failures. Inthis paper, we first define the concept of feature propagation on graphformally, and then study its convergence conditions to equilibrium states. Wefurther link feature propagation to several established approaches such asnode2vec and structure2vec. In the end of this paper, we extend existingapproaches from represent nodes to edges (edge2vec) and demonstrate itsapplications on fraud transaction detection in real world scenario. Experimentsshow that it is quite competitive.]]>

Network alignment is a critical task to a wide variety of fields. Manyexisting works leverage on representation learning to accomplish this taskwithout eliminating domain representation bias induced by domain-dependentfeatures, which yield inferior alignment performance. This paper proposes aunified deep architecture (DANA) to obtain a domain-invariant representationfor network alignment via an adversarial domain classifier. Specifically, weemploy the graph convolutional networks to perform network embedding under thedomain adversarial principle, given a small set of observed anchors. Then, thesemi-supervised learning framework is optimized by maximizing a posteriorprobability distribution of observed anchors and the loss of a domainclassifier simultaneously. We also develop a few variants of our model, suchas, direction-aware network alignment, weight-sharing for directed networks andsimplification of parameter space. Experiments on three real-world socialnetwork datasets demonstrate that our proposed approaches achievestate-of-the-art alignment results.]]>

Dashboard cameras capture a tremendous amount of driving scene video eachday. These videos are purposefully coupled with vehicle sensing data, such asfrom the speedometer and inertial sensors, providing an additional sensingmodality for free. In this work, we leverage the large-scale unlabeled yetnaturally paired data for visual representation learning in the drivingscenario. A representation is learned in an end-to-end self-supervisedframework for predicting dense optical flow from a single frame with pairedsensing data. We postulate that success on this task requires the network tolearn semantic and geometric knowledge in the ego-centric view. For example,forecasting a future view to be seen from a moving vehicle requires anunderstanding of scene depth, scale, and movement of objects. We demonstratethat our learned representation can benefit other tasks that require detailedscene understanding and outperforms competing unsupervised representations onsemantic segmentation.]]>

Attributed network embedding has received much interest from the researchcommunity as most of the networks come with some content in each node, which isalso known as node attributes. Existing attributed network approaches work wellwhen the network is consistent in structure and attributes, and nodes behave asexpected. But real world networks often have anomalous nodes. Typically theseoutliers, being relatively unexplainable, affect the embeddings of other nodesin the network. Thus all the downstream network mining tasks fail miserably inthe presence of such outliers. Hence an integrated approach to detect anomaliesand reduce their overall effect on the network embedding is required.Towards this end, we propose an unsupervised outlier aware network embeddingalgorithm (ONE) for attributed networks, which minimizes the effect of theoutlier nodes, and hence generates robust network embeddings. We align andjointly optimize the loss functions coming from structure and attributes of thenetwork. To the best of our knowledge, this is the first generic networkembedding approach which incorporates the effect of outliers for an attributednetwork without any supervision. We experimented on publicly available realnetworks and manually planted different types of outliers to check theperformance of the proposed algorithm. Results demonstrate the superiority ofour approach to detect the network outliers compared to the state-of-the-artapproaches. We also consider different downstream machine learning applicationson networks to show the efficiency of ONE as a generic network embeddingtechnique. The source code is made available atthis https URL.]]>

Most of the existing medicine recommendation systems that are mainly based onelectronic medical records (EMRs) are significantly assisting doctors to makebetter clinical decisions benefiting both patients and caregivers. Even thoughthe growth of EMRs is at a lighting fast speed in the era of big data, contentlimitations in EMRs restrain the existed recommendation systems to reflectrelevant medical facts, such as drug-drug interactions. Many medical knowledgegraphs that contain drug-related information, such as DrugBank, may give hopefor the recommendation systems. However, the direct use of these knowledgegraphs in the systems suffers from robustness caused by the incompleteness ofthe graphs. To address these challenges, we stand on recent advances in graphembedding learning techniques and propose a novel framework, called SafeMedicine Recommendation (SMR), in this paper. Specifically, SMR firstconstructs a high-quality heterogeneous graph by bridging EMRs (MIMIC-III) andmedical knowledge graphs (ICD-9 ontology and DrugBank). Then, SMR jointlyembeds diseases, medicines, patients, and their corresponding relations into ashared lower dimensional space. Finally, SMR uses the embeddings to decomposethe medicine recommendation into a link prediction process while consideringthe patient's diagnoses and adverse drug reactions. To our best knowledge, SMRis the first to learn embeddings of a patient-disease-medicine graph formedicine recommendation in the world. Extensive experiments on real datasetsare conducted to evaluate the effectiveness of proposed framework.]]>

The estimation of an f-divergence between two probability distributions basedon samples is a fundamental problem in statistics and machine learning. Mostworks study this problem under very weak assumptions, in which case it isprovably hard. We consider the case of stronger structural assumptions that arecommonly satisfied in modern machine learning, including representationlearning and generative modelling with autoencoder architectures. Under theseassumptions we propose and study an estimator that can be easily implemented,works well in high dimensions, and enjoys faster rates of convergence. Weverify the behavior of our estimator empirically in both synthetic andreal-data experiments, and discuss its direct implications for totalcorrelation, entropy, and mutual information estimation.]]>

Learning neural program embeddings is key to utilizing deep neural networksin program languages research --- precise and efficient program representationsenable the application of deep models to a wide range of program analysistasks. Existing approaches predominately learn to embed programs from theirsource code, and, as a result, they do not capture deep, precise programsemantics. On the other hand, models learned from runtime informationcritically depend on the quality of program executions, thus leading to trainedmodels with highly variant quality. This paper tackles these inherentweaknesses of prior approaches by introducing a new deep neural network,\liger, which learns program representations from a mixture of symbolic andconcrete execution traces. We have evaluated \liger on \coset, a recentlyproposed benchmark suite for evaluating neural program embeddings. Results show\liger (1) is significantly more accurate than the state-of-the-artsyntax-based models Gated Graph Neural Network and code2vec in classifyingprogram semantics, and (2) requires on average 10x fewer executions covering74\% fewer paths than the state-of-the-art dynamic model \dypro. Furthermore,we extend \liger to predict the name for a method from its body's vectorrepresentation. Learning on the same set of functions (more than 170K intotal), \liger significantly outperforms code2seq, the previousstate-of-the-art for method name prediction.]]>

This paper proposes a self-supervised learning approach for video featuresthat results in significantly improved performance on downstream tasks (such asvideo classification, captioning and segmentation) compared to existingmethods. Our method extends the BERT model for text sequences to the case ofsequences of real-valued feature vectors, by replacing the softmax loss withnoise contrastive estimation (NCE). We also show how to learn representationsfrom sequences of visual features and sequences of words derived from ASR(automatic speech recognition), and show that such cross-modal training (whenpossible) helps even more.]]>

We introduce a parameterization method called Neural Bayes which allowscomputing statistical quantities that are in general difficult to compute andopens avenues for formulating new objectives for unsupervised representationlearning. Specifically, given an observed random variable $\mathbf{x}$ and alatent discrete variable $z$, we can express $p(\mathbf{x}|z)$,$p(z|\mathbf{x})$ and $p(z)$ in closed form in terms of a sufficientlyexpressive function (Eg. neural network) using our parameterization withoutrestricting the class of these distributions. To demonstrate its usefulness, wedevelop two independent use cases for this parameterization:1. Mutual Information Maximization (MIM): MIM has become a popular means forself-supervised representation learning. Neural Bayes allows us to computemutual information between observed random variables $\mathbf{x}$ and latentdiscrete random variables $z$ in closed form. We use this for learning imagerepresentations and show its usefulness on downstream classification tasks.2. Disjoint Manifold Labeling: Neural Bayes allows us to formulate anobjective which can optimally label samples from disjoint manifolds present inthe support of a continuous distribution. This can be seen as a specific formof clustering where each disjoint manifold in the support is a separatecluster. We design clustering tasks that obey this formulation and empiricallyshow that the model optimally labels the disjoint manifolds. Our code isavailable at \url{this https URL}]]>

Graphs are a natural abstraction for many problems where nodes represententities and edges represent a relationship across entities. An important areaof research that has emerged over the last decade is the use of graphs as avehicle for non-linear dimensionality reduction in a manner akin to previousefforts based on manifold learning with uses for downstream databaseprocessing, machine learning and visualization. In this systematic yetcomprehensive experimental survey, we benchmark several popular networkrepresentation learning methods operating on two key tasks: link prediction andnode classification. We examine the performance of 12 unsupervised embeddingmethods on 15 datasets. To the best of our knowledge, the scale of our study --both in terms of the number of methods and number of datasets -- is the largestto date.Our results reveal several key insights about work-to-date in this space.First, we find that certain baseline methods (task-specific heuristics, as wellas classic manifold methods) that have often been dismissed or are notconsidered by previous efforts can compete on certain types of datasets if theyare tuned appropriately. Second, we find that recent methods based on matrixfactorization offer a small but relatively consistent advantage overalternative methods (e.g., random-walk based methods) from a qualitativestandpoint. Specifically, we find that MNMF, a community preserving embeddingmethod, is the most competitive method for the link prediction task. WhileNetMF is the most competitive baseline for node classification. Third, nosingle method completely outperforms other embedding methods on both nodeclassification and link prediction tasks. We also present several drill-downanalysis that reveals settings under which certain algorithms perform well(e.g., the role of neighborhood context on performance) -- guiding theend-user.]]>

Existing works on disentangled representation learning usually lie on acommon assumption: all factors in disentangled representations should beindependent. This assumption is about the inner property of disentangledrepresentations, while ignoring their relation with external data. To tacklethis problem, we propose another assumption to establish an important relationbetween data and its disentangled representations via mutual information: themutual information between each factor of disentangled representations and datashould be invariant to other factors. We formulate this assumption intomathematical equations, and theoretically bridge it with independence andconditional independence of factors. Meanwhile, we show that conditionalindependence is satisfied in encoders of VAEs due to factorized noise inreparameterization. To highlight the importance of our proposed assumption, weshow in experiments that violating the assumption leads to dramatic decline ofdisentanglement. Based on this assumption, we further propose to split thedeeper layers in encoder to ensure parameters in these layers are not sharedfor different factors. The proposed encoder, called Split Encoder, can beapplied into models that penalize total correlation, and shows significantimprovement in unsupervised learning of disentangled representations andreconstructions.]]>

We present an unsupervised representation learning approach that compactlyencodes the motion dependencies in videos. Given a pair of images from a videoclip, our framework learns to predict the long-term 3D motions. To reduce thecomplexity of the learning framework, we propose to describe the motion as asequence of atomic 3D flows computed with RGB-D modality. We use a RecurrentNeural Network based Encoder-Decoder framework to predict these sequences offlows. We argue that in order for the decoder to reconstruct these sequences,the encoder must learn a robust video representation that captures long-termmotion dependencies and spatial-temporal relations. We demonstrate theeffectiveness of our learned temporal representations on activityclassification across multiple modalities and datasets such as NTU RGB+D andMSR Daily Activity 3D. Our framework is generic to any input modality, i.e.,RGB, Depth, and RGB-D videos.]]>