摘要 :
Memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. In the past, the memory-based approach has been shown to suffer from two fundamental probl...
展开
Memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. In the past, the memory-based approach has been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. Alternatively, the model-based approach has been proposed to alleviate these problems, but this approach tends to limit the range of users. In this paper, we present a novel approach that combines the advantages of these two approaches by introducing a smoothing-based method. In our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. As a result, we provide higher accuracy as well as increased efficiency in recommendations. Empirical studies on two datasets (EachMovie and MovieLens) show that our new proposed approach consistently outperforms other state-of-art collaborative filtering algorithms.
收起
摘要 :
Memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. In the past, the memory-based approach has been shown to suffer from two fundamental probl...
展开
Memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. In the past, the memory-based approach has been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. Alternatively, the model-based approach has been proposed to alleviate these problems, but this approach tends to limit the range of users. In this paper, we present a novel approach that combines the advantages of these two approaches by introducing a smoothing-based method. In our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. As a result, we provide higher accuracy as well as increased efficiency in recommendations. Empirical studies on two datasets (EachMovie and MovieLens) show that our new proposed approach consistently outperforms other state-of-art collaborative filtering algorithms.
收起
摘要 :
Simulation of the real-world traffic can be used to help validate the transportation policies. A good simulator means the simulated traffic is similar to real-world traffic, which often requires dense traffic trajectories (i.e., w...
展开
Simulation of the real-world traffic can be used to help validate the transportation policies. A good simulator means the simulated traffic is similar to real-world traffic, which often requires dense traffic trajectories (i.e., with high sampling rate) to cover dynamic situations in the real world. However, in most cases, the real-world trajectories are sparse, which makes simulation challenging. In this paper, we present a novel framework ImIn-GAIL to address the problem of learning to simulate the driving behavior from sparse real-world data. The proposed architecture incorporates data interpolation with the behavior learning process of imitation learning. To the best of our knowledge, we are the first to tackle the data sparsity issue for behavior learning problems. We investigate our framework on both synthetic and real-world trajectory datasets of driving vehicles, showing that our method outperforms various baselines and state-of-the-art methods.
收起
摘要 :
Simulation of the real-world traffic can be used to help validate the transportation policies. A good simulator means the simulated traffic is similar to real-world traffic, which often requires dense traffic trajectories (i.e., w...
展开
Simulation of the real-world traffic can be used to help validate the transportation policies. A good simulator means the simulated traffic is similar to real-world traffic, which often requires dense traffic trajectories (i.e., with high sampling rate) to cover dynamic situations in the real world. However, in most cases, the real-world trajectories are sparse, which makes simulation challenging. In this paper, we present a novel framework ImIn-GAIL to address the problem of learning to simulate the driving behavior from sparse real-world data. The proposed architecture incorporates data interpolation with the behavior learning process of imitation learning. To the best of our knowledge, we are the first to tackle the data sparsity issue for behavior learning problems. We investigate our framework on both synthetic and real-world trajectory datasets of driving vehicles, showing that our method outperforms various baselines and state-of-the-art methods.
收起
摘要 :
Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled st...
展开
Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier's feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space.
收起
摘要 :
Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled st...
展开
Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier's feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space.
收起
摘要 :
Electronic Medical Records (EHR) are extremely sparse. Only a small proportion of events (symptoms, diagnoses, and treatments) are observed in the lifetime of an individual. The high degree of missingness of EHR can be attributed ...
展开
Electronic Medical Records (EHR) are extremely sparse. Only a small proportion of events (symptoms, diagnoses, and treatments) are observed in the lifetime of an individual. The high degree of missingness of EHR can be attributed to a large number of factors, including device failure, privacy concerns, or other unexpected reasons. Unfortunately, many traditional imputation methods are not well suited for highly sparse data and scale poorly to high dimensional datasets. In this paper, we propose a graph-based imputation method that is both robust to sparsity and to unreliable unmeasured events. Our approach compares favourably to several standard and state-of-the-art imputation methods in terms of performance and runtime. Moreover, results indicate that the model learns to embed different event types in a clinically meaningful way. Our work can facilitate the diagnosis of novel diseases based on the clinical history of past events, with the potential to increase our understanding of the landscape of comorbidities.
收起
摘要 :
A timely and accurate prediction of the vehicle's trajectory can render a good service, help predict traffic flow, and in the more interesting case, detect potentially dangerous situations as early as possible. It is widespread to...
展开
A timely and accurate prediction of the vehicle's trajectory can render a good service, help predict traffic flow, and in the more interesting case, detect potentially dangerous situations as early as possible. It is widespread to use historical data to predict trajectories. However, due to the sparsity of trajectory, sometimes it is difficult to predicate trajectory. A novel approach to trajectory prediction is proposed which has the capability to predict the trajectory with sparse data. To achieve this, we denote city route and trajectory as a series of fixed points. Under normal conditions, only the most similarly trajectory is used to predict trajectories, which may mislead the result. To remedy this problem, we use several most similarity trajectories to predict the next point, and regard the most points as our result. The experimental results show that the approach improves the prediction accuracy 8%, coverage 2%, and greatly outperform the baseline algorithm (30 ms) compared with Spatial Iterative Grid Partition (SIGP).
收起
摘要 :
In this paper, we propose some sparsity aware algorithms, namely the Recursive least-Squares for sparse systems (S-RLS) and l0-norm Recursive least-Squares (l0-RLS), in order to exploit the sparsity of an unknown system. The first...
展开
In this paper, we propose some sparsity aware algorithms, namely the Recursive least-Squares for sparse systems (S-RLS) and l0-norm Recursive least-Squares (l0-RLS), in order to exploit the sparsity of an unknown system. The first algorithm, applies a discard function on the weight vector to disregard the coefficients close to zero during the update process. The second algorithm, employs the sparsity-promoting scheme via some non-convex approximations to the l0-norm. In addition, we consider the respective versions of these algorithms in data-selective versions in order to reduce the update rate. Simulation results show similar performance when comparing the proposed algorithms with standard Recursive Least-Squares (RLS) algorithm while the proposed algorithms require lower computational complexity.
收起
摘要 :
This paper proposes a collaborative filtering recommendation algorithm based on multi-dimensional data filling aiming at data sparsity of collaborative filtering. In order to find the dimensions which not only make data collecting...
展开
This paper proposes a collaborative filtering recommendation algorithm based on multi-dimensional data filling aiming at data sparsity of collaborative filtering. In order to find the dimensions which not only make data collecting more convenient, but also distinguish the user's interests prominently, this algorithm evaluates the user's dimensions. Then user-item rating matrix is filled with multidimensional data in the basis of the chosen dimensions. In this way, the degree of data sparsity of rating data is decreased and the similarity judgment is provided with more data .The results obtained by contrast experiment present that the algorithm proposed in this paper has higher efficiency of recommendation than traditional algorithm.
收起