摘要 :
The 21st century has ushered in the age of big data and data economy, in which data DNA, which carries important knowledge, insights, and potential, has become an intrinsic constituent of all data-based organisms. An appropriate u...
展开
The 21st century has ushered in the age of big data and data economy, in which data DNA, which carries important knowledge, insights, and potential, has become an intrinsic constituent of all data-based organisms. An appropriate understanding of data DNA and its organisms relies on the new field of data science and its keystone, analytics. Although it is widely debated whether big data is only hype and buzz, and data science is still in a very early phase, significant challenges and opportunities are emerging or have been inspired by the research, innovation, business, profession, and education of data science. This article provides a comprehensive survey and tutorial of the fundamental aspects of data science: the evolution from data analysis to data science, the data science concepts, a big picture of the era of data science, the major challenges and directions in data innovation, the nature of data analytics, new industrialization and service opportunities in the data economy, the profession and competency of data education, and the future of data science. This article is the first in the field to draw a comprehensive big picture, in addition to offering rich observations, lessons, and thinking about data science and analytics.
收起
摘要 :
The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogen...
展开
The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation.
This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each.
收起
摘要 :
Very large amounts of geospatial data are daily generated by many observation processes in different application domains. The amount of produced data is increasing due to the advances in the use of modern automatic sensing devices...
展开
Very large amounts of geospatial data are daily generated by many observation processes in different application domains. The amount of produced data is increasing due to the advances in the use of modern automatic sensing devices and also in the facilities available to promote crowdsourcing data collection initiatives. Spatial observation data includes both data of conventional entities and also samplings over multi-dimensional spaces. Existing observation data management solutions lack declarative specification of spatio-temporal analytics. On the other hand, current data management technologies miss observation data semantics and fail to integrate the management of entities and samplings in a single data modeling solution. The present paper presents the design of a framework that enables spatio-temporal declarative analysis over large warehouses of observation data. It integrates the management of entities and samplings within a simple data model based on the well known mathematical concept of function. Observation data semantics are incorporated into the model with appropriate metadata structures.
收起
摘要 :
Data Ecosystems (DE) are used across various fields and applications. They facilitate collaboration between organizations, such as companies or research institutions, enabling them to share data and services. A DE can boost resear...
展开
Data Ecosystems (DE) are used across various fields and applications. They facilitate collaboration between organizations, such as companies or research institutions, enabling them to share data and services. A DE can boost research outcomes by managing and extracting value from the increasing volume of generated and shared data in the last decades. However, the adoption of DE solutions for scientific data by R&D departments and scientific communities is still difficult. Scientific data are challenging to manage, and, as a result, a considerable part of this information still needs to be annotated and organized in order to be shared. This work discusses the challenges of employing DE in scientific domains and the corresponding potential mitigations. First, scientific data and their typologies are contextualized, then their unique characteristics are discussed. Typical properties regarding their high heterogeneity and uncertainty make assessing their consistency and accuracy problematic. In addition, this work discusses the specific requirements expressed by the scientific communities when it comes to integrating a DE solution into their workflow. The unique properties of scientific data and domain-specific requirements create a challenging setting for adopting DEs. The challenges are expressed as general research questions, and this work explores the corresponding solutions in terms of data management aspects. Finally, the paper presents a real-world scenario with more technical details.
收起
摘要 :
We present the results of a phytosociological survey of the primary forests of different areas in the NE of Rio Grande do Sul (Brazil). The study combines the Braun-Blanquet's methodology of collecting phytosociological data with ...
展开
We present the results of a phytosociological survey of the primary forests of different areas in the NE of Rio Grande do Sul (Brazil). The study combines the Braun-Blanquet's methodology of collecting phytosociological data with a technique of numerical classification (Average Linkage Clustering based on Euclidean distance). Five associations, five sub-associations, three alliances, two orders and one phytosociological class are proposed in this paper on the basis of the cluster analysis of 59 phytosociological releves, and of floristic, biogeographical, bioclimatic, edaphic, and physiognomic data from previous phytosociological studies of Brazil's forests.
收起
摘要 :
Electricity data could generate a large number of records from smart meter day by day. The traditional architecture might not properly handle the increasingly dynamic data that need flexibility. For effective storing and analytics...
展开
Electricity data could generate a large number of records from smart meter day by day. The traditional architecture might not properly handle the increasingly dynamic data that need flexibility. For effective storing and analytics, efficient architecture is needed to provide much greater data volumes and varieties. In this paper, we proposed the architecture of data storage and analytic in the big data lake of electricity usage using Spark. Apache Sqoop was used to migrate historical data to Apache Hive for processing from an existing system. Apache Kafka was used as the input source for Spark to stream data to Apache HBase to ensure the integrity of the streaming data. In order to integrate the data, we use the Hive and HBase principle of Data Lake as search engines for Hive and HBase. Apache Impala and Apache Phoenix are used separately. This work also analyzes electricity usage and power failure with Apache Spark. All of the visualizations of this project are presented in Apache Superset. Moreover, the usage prediction comparison is presented using HoltWinters algorithm.
收起
摘要 :
The collection, organization, and long-term preservation of resources are the raison d'etre of archives and archivists. The archival community, however, has largely neglected science data, assuming they were outside the bounds of ...
展开
The collection, organization, and long-term preservation of resources are the raison d'etre of archives and archivists. The archival community, however, has largely neglected science data, assuming they were outside the bounds of their professional concerns. Scientists, on the other hand, increasingly recognize that they lack the skills and expertise needed to meet the demands being placed on them with regard to data curation and are seeking the help of "data archivists" and "data curators." This represents a significant opportunity for archivists and archival scholars but one that can only be realized if they better understand the scientific context.
收起
摘要 :
Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical ap...
展开
Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining, and customer relationship management systems. A major problem that is only beginning to be recognized is that the data in data sources are often "dirty". Broadly, dirty data include missing data, wrong data, and non-standard representations of the same data. The results of analyzing a database/data warehouse of dirty data can be damaging and at best be unreliable. In this paper, a comprehensive classification of dirty data is developed for use as a framework for understanding how dirty data arise, manifest themselves, and maybe cleansed to ensure proper construction of data warehouses and accurate data analysis. The impact of dirty data on data mining is also explored.
收起
摘要 :
Purpose - The massive increase in research data being produced nowadays has highlighted the importance of research data management (RDM) to science. Research data not only have to be cost effective but also reliable, discoverable,...
展开
Purpose - The massive increase in research data being produced nowadays has highlighted the importance of research data management (RDM) to science. Research data not only have to be cost effective but also reliable, discoverable, accessible, and reusable. In this regard, the purpose of this paper is to investigate the perceptions and practices of Turkish researchers on the subject of RDM.
收起
摘要 :
The burgeoning use of numeric data resources across all academic disciplines raises significant questions about the library's role in providing data services and promoting quantitative literacy. This study analyzes the content of ...
展开
The burgeoning use of numeric data resources across all academic disciplines raises significant questions about the library's role in providing data services and promoting quantitative literacy. This study analyzes the content of web pages to determine the presence and promotion of data resources and services at a random sample of large research libraries (and their related institutions) in North America. The results prompt the authors to challenge libraries to more fully engage their users in efforts at richer discovery, use, and analysis of data.
收起