摘要 :
The stylistic properties of text have intrigued computational linguistics researchers in recent years. Specifically, researchers have investigated the text style transfer task (TST), which aims to change the stylistic properties o...
展开
The stylistic properties of text have intrigued computational linguistics researchers in recent years. Specifically, researchers have investigated the text style transfer task (TST), which aims to change the stylistic properties of the text while retaining its independent content of style. Over the last few years, many novel TST algorithms have been developed, while the industry has leveraged these algorithms to enable exciting TST applications. The field of TST research has developed because of this symbiosis. This article aims to provide a comprehensive review of recent research efforts on text style transfer. More concretely, we create a taxonomy to organize the TST models, and provide a comprehensive summary of the state of the art. We review existing evaluation methodologies for TST tasks and conduct a large-scale reproducibility study in which we experimentally benchmark 19 state-of-the-art TST algorithms on two publicly available datasets. Finally, we expand on current trends and provide new perspectives on the new and exciting developments in the TST field.
收起
摘要 :
Looking from a global perspective, the landscape of online social networks is highly fragmented. A large number of online social networks have appeared, which can provide users with various types of services. Generally, informatio...
展开
Looking from a global perspective, the landscape of online social networks is highly fragmented. A large number of online social networks have appeared, which can provide users with various types of services. Generally, information available in these online social networks is of diverse categories, which can be represented as heterogeneous social networks (HSNs) formally. Meanwhile, in such an age of online social media, users usually participate in multiple online social networks simultaneously, who can act as the anchors aligning different social networks together. So multiple HSNs not only represent information in each social network, but also fuse information from multiple networks. Formally, the online social networks sharing common users are named as the aligned social networks, and these shared users are called the anchor users. The heterogeneous information generated by users' social activities in the multiple aligned social networks provides social network practitioners and researchers with the opportunities to study individual user's social behaviors across multiple social platforms simultaneously. This paper presents a comprehensive survey about the latest research works on multiple aligned HSNs studies based on the broad learning setting, which covers 5 major research tasks, including network alignment, link prediction, community detection, information diffusion and network embedding respectively.
收起
摘要 :
Ensemble analysis has recently been studied in the context of the outlier detection problem. In this paper, we investigate the theoretical underpinnings of outlier ensemble analysis. In spite of the significant differences between...
展开
Ensemble analysis has recently been studied in the context of the outlier detection problem. In this paper, we investigate the theoretical underpinnings of outlier ensemble analysis. In spite of the significant differences between the classification and the outlier analysis problems, we show that the theoretical underpinnings between the two problems are actually quite similar in terms of the bias-variance trade-off. We explain the existing algorithms within this traditional framework, and clarify misconceptions about the reasoning underpinning these methods. We propose more effective variants of subsampling and feature bagging. We also discuss the impact of the combination function and discuss the specific trade-offs of the average and maximization functions. We use these insights to propose new combination functions that are robust in many settings.
收起
摘要 :
Traditional collaborative filtering assumes the availability of explicit ratings of users for items. However, in many cases these ratings are not available and only binary, positive-only data is available. Binary, positive-only da...
展开
Traditional collaborative filtering assumes the availability of explicit ratings of users for items. However, in many cases these ratings are not available and only binary, positive-only data is available. Binary, positive-only data is typically associated with implicit feedback such as items bought, videos watched, ads clicked on, etc. However, it can also be the results of explicit feedback such as likes on social networking sites. Because binary, positive-only data contains no negative information, it needs to be treated differently than rating data. As a result of the growing relevance of this problem setting, the number of publications in this field increases rapidly. In this survey, we provide an overview of the existing work from an innovative perspective that allows us to emphasize surprising commonalities and key differences.
收起
摘要 :
More and more applications require early decisions, i.e. taken as soon as possible from partially observed data. However, the later a decision is made, the more its accuracy tends to improve, since the description of the problem t...
展开
More and more applications require early decisions, i.e. taken as soon as possible from partially observed data. However, the later a decision is made, the more its accuracy tends to improve, since the description of the problem to hand is enriched over time. Such a compromise between the earliness and the accuracy of decisions has been particularly studied in the field of Early Time Series Classification. This paper introduces a more general problem, called Machine Learning based Early Decision Making (ML-EDM), which consists in optimizing the decision times of models in a wide range of settings where data is collected over time. After defining the ML-EDM problem, ten challenges are identified and proposed to the scientific community to further research in this area. These challenges open important application perspectives, discussed in this paper.
收起
摘要 :
Nowadays, data is created by humans as well as automatically collected by physical things, which embed electronics, software, sensors and network connectivity. Together, these entities constitute the Internet of Things (IoT). The ...
展开
Nowadays, data is created by humans as well as automatically collected by physical things, which embed electronics, software, sensors and network connectivity. Together, these entities constitute the Internet of Things (IoT). The automated analysis of its data can provide insights into previously unknown relationships between things, their environment and their users, facilitating an optimization of their behavior. Especially the real-time analysis of data, embedded into physical systems, can enable new forms of autonomous control. These in turn may lead to more sustainable applications, reducing waste and saving resources. IoT's distributed and dynamic nature, resource constraints of sensors and embedded devices as well as the amounts of generated data are challenging even the most advanced automated data analysis methods known today. In particular, the IoT requires a new generation of distributed analysis methods. Many existing surveys have strongly focused on the centralization of data in the cloud and big data analysis, which follows the paradigm of parallel high-performance computing. However, bandwidth and energy can be too limited for the transmission of raw data, or it is prohibited due to privacy constraints. Such communication-constrained scenarios require decentralized analysis algorithms which at least partly work directly on the generating devices. After listing data-driven IoT applications, in contrast to existing surveys, we highlight the differences between cloud-based and decentralized analysis from an algorithmic perspective. We present the opportunities and challenges of research on communication-efficient decentralized analysis algorithms. Here, the focus is on the difficult scenario of vertically partitioned data, which covers common IoT use cases. The comprehensive bibliography aims at providing readers with a good starting point for their own work.
收起
摘要 :
Human mobility is a multidisciplinary field of physics and computer science and has drawn a lot of attentions in recent years. Some representative models and prediction approaches have been proposed for modeling and predicting hum...
展开
Human mobility is a multidisciplinary field of physics and computer science and has drawn a lot of attentions in recent years. Some representative models and prediction approaches have been proposed for modeling and predicting human mobility. However, multi-source heterogeneous data from handheld terminals, GPS, and social media, provides a new driving force for exploring urban human mobility patterns from a quantitative and microscopic perspective. The studies of human mobility modeling and prediction play a vital role in a series of applications such as urban planning, epidemic control, location-based services, and intelligent transportation management. In this survey, we review human mobility models based on a human-centric angle in a data-driven context. Specifically, we characterize human mobility patterns from individual, collective, and hybrid levels. Meanwhile, we survey human mobility prediction methods from four aspects and then describe recent development respectively. Finally, we discuss some open issues that provide a helpful reference for researchers' future direction. This review not only lays a solid foundation for beginners who want to acquire a quick understanding of human mobility but also provides helpful information for researchers on how to develop a unified human mobility model.
收起
摘要 :
Every day, an enormous amount of text data is produced. Sources of text data include news, social media, emails, text messages, medical reports, scientific publications and fiction. To keep track of this data, there are categories...
展开
Every day, an enormous amount of text data is produced. Sources of text data include news, social media, emails, text messages, medical reports, scientific publications and fiction. To keep track of this data, there are categories, key words, tags or labels that are assigned to each text. Automatically predicting such labels is the task of multi-label text classification. Often however, we are interested in more than just the pure classification: rather, we would like to understand which parts of a text belong to the label, which words are important for the label or which labels occur together. Because of this, topic models may be used for multi-label classification as an interpretable model that is flexible and easily extensible. This survey demonstrates the manifold possibilities and flexibility of the topic model framework for the complex setting of multi-label text classification by categorizing different variants of models.
收起
摘要 :
Patent documents are important intellectual resources of protecting interests of individuals, organizations and companies. Different from general web documents, patent documents have a well-defined format including frontpage, desc...
展开
Patent documents are important intellectual resources of protecting interests of individuals, organizations and companies. Different from general web documents, patent documents have a well-defined format including frontpage, description, claims, and figures. However, they are lengthy and rich in technical terms, which requires enormous human efforts for analysis. Hence, a new research area, called patent mining, emerges in recent years, aiming to assist patent analysts in investigating, processing, and analyzing patent documents. Despite the recent advances in patent mining, it is still far from being well explored in research communities. To help patent analysts and interested readers obtain a big picture of patent mining, we thus provide a systematic summary of existing research efforts along this direction. In this survey, we first present an overview of the technical trend in patent mining. We then investigate multiple research questions related to patent documents, including patent retrieval, patent classification, and patent visualization, and provide summaries and highlights for each question by delving into the corresponding research efforts.
收起
摘要 :
The Web is a steadily evolving resource comprising much more than mere HTML pages. With its ever-growing data sources in a variety of formats, it provides great potential for knowledge discovery. In this article, we shed light on ...
展开
The Web is a steadily evolving resource comprising much more than mere HTML pages. With its ever-growing data sources in a variety of formats, it provides great potential for knowledge discovery. In this article, we shed light on some interesting phenomena of the Web: the deep Web, which surfaces database records as Web pages; the Semantic Web, which defines meaningful data exchange formats; XML, which has established itself as a lingua franca for Web data exchange; and domain-specific markup languages, which are designed based on XML syntax with the goal of preserving semantics in targeted domains. We detail these four developments in Web technology, and explain how they can be used for data mining. Our goal is to show that all these areas can be as useful for knowledge discovery as the HTML-based part of the Web.
收起