摘要 :
Heterogeneous mobile, sensor, loT, smart environment, and social networking applications have recently started to produce unbounded, fast, and massive-scale streams of data that have to be processed "on the fly". Systems that proc...
展开
Heterogeneous mobile, sensor, loT, smart environment, and social networking applications have recently started to produce unbounded, fast, and massive-scale streams of data that have to be processed "on the fly". Systems that process such data have to be enhanced with detection for operational exceptions and with triggers for both automated and manual operator actions. In this paper, we illustrate how tracing in distributed data processing systems can be applied to detecting changes in data and operational environment to maintain the efficiency of heterogeneous data stream processing systems under potentially changing data quality and distribution. By the tracing of individual input records, we can (1) identify outliers in a web crawling and document processing system and use the insights to define URL filtering rules; (2) identify heavy keys, such as NULL, that should be filtered before processing; (3) give hints to improve the key-based partitioning mechanisms; and (4) measure the limits of overpartitioning if heavy thread-unsafe libraries are imported.By using Apache Spark as illustration, we show how various data stream processing efficiency issues can be mitigated or optimized by our distributed tracing engine. We describe and qualitatively compare two different designs, one based on reporting to a distributed database and another based on trace piggybacking. Our prototype implementation consists of wrappers suitable for JVM environments in general, with minimal impact on the source code of the core system. Our tracing framework is the first to solve tracing in multiple systems across boundaries and to provide detailed performance measurements suitable for automated optimization, not just debugging. (C) 2018 Elsevier B.V. All rights reserved.
收起
摘要 :
In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We propose an Adaptive Reduce T...
展开
In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We propose an Adaptive Reduce Task Scheduling (ARTS) algorithm and evaluate it on a distributed Hadoop cluster involving multiple datacenters as well as the on a shared Hadoop cluster. The evaluation demonstrates that the ARTS algorithm outperforms the default Reduce phase scheduling algorithm in Hadoop framework.
收起
摘要 :
The total dataset produced by the BaBar experiment at the Stanford Linear Accelerator Center (SLAC) currently comprises roughly$3times 10^9$data events and an equal amount of simulated events, corresponding to 23 Tbytes of real da...
展开
The total dataset produced by the BaBar experiment at the Stanford Linear Accelerator Center (SLAC) currently comprises roughly$3times 10^9$data events and an equal amount of simulated events, corresponding to 23 Tbytes of real data and 51 Tbytes simulated events. Since individual analyses typically select a very small fraction of all events, it would be extremely inefficient if each analysis had to process the full dataset. A first, centrally managed analysis step is therefore a common pre-selection (‘skimming’) of all data according to very loose, inclusive criteria to facilitate data access for later analysis. Usually, there are common selection criteria for several analysis. However, they may change over time, e.g., when new analyses are developed. Currently,$cal O
收起
摘要 :
The trend toward distributed processing has significatly increased the awareness of data as a key corporate resource and underscored the importance of its management. In spite of this, there is a lack of empirical investigation of...
展开
The trend toward distributed processing has significatly increased the awareness of data as a key corporate resource and underscored the importance of its management. In spite of this, there is a lack of empirical investigation of issues related to data resource management (DRM) in distributed processing environments. being perhaps the first empirical attempt, this exploratory study identifies four information systems (IS) variables related to DRm in a dis- tributed environment. It also seeks to examine the notion of gestalt fit to describe the nature of the relationships among these variables.
收起
摘要 :
The upgraded VHF digital interferometer (VHF DITF) system is introduced which can continuously sample the radiation associated with lightning. A new processing technique was implemented which uses the distribution of slopes of the...
展开
The upgraded VHF digital interferometer (VHF DITF) system is introduced which can continuously sample the radiation associated with lightning. A new processing technique was implemented which uses the distribution of slopes of the phase difference versus frequency to locate the radiation source. By using this technique, frequency components which are not due to lightning can be excluded and low as well as high amplitude sources are located. As a result, both positive breakdown and negative breakdown are located, and negative recoil leaders (recoil leaders) are visualized in great detail. The recoil leaders which continue into the positive charge region are seen to slow their propagation and dim their radiation as they cross the flash initiation region. Analysis of the relative received power of the different breakdown types, negative leaders, recoil leaders, and positive leaders, also can be made. In both the intracloud and cloud-to-ground flash, the modes of the distributions of received power for negative leaders, recoil leaders, and positive leaders were approximately the same. The brightest emissions seen from the positive leader were substantially lower than the brightest emission seen from the negative leader. The results also indicate that positive leaders as well as lower elevation negative leader emit more low frequency radiation than recoil leaders and high-elevation negative leaders. By continuously sampling the VHF waveform, the upgraded VHF DITF locates many weak sources which the previous system was not capable of locating.
收起
摘要 :
The CMS experiment is currently developing a computing system capable of serving, processing and archiving the large number of events that will be generated when the CMS detector starts taking data. During 2004 CMS undertook a lar...
展开
The CMS experiment is currently developing a computing system capable of serving, processing and archiving the large number of events that will be generated when the CMS detector starts taking data. During 2004 CMS undertook a large scale data challenge to demonstrate the ability of the CMS computing system to cope with a sustained data-taking rate equivalent to 25% of startup rate. Its goals were: to run CMS event reconstruction at CERN for a sustained period at 25 Hz input rate; to distribute the data to several regional centers; and enable data access at those centers for analysis. Grid middleware was utilized to help complete all aspects of the challenge. To continue to provide scalable access from anywhere in the world to the data, CMS is developing a layer of software that uses Grid tools to gain access to data and resources, and that aims to provide physicists with a user friendly interface for submitting their analysis jobs. This paper describes the data challenge experience with Grid infrastructure and the current development of the CMS analysis system.
收起
摘要 :
Bandwidth efficient execution of online big data analytics in telecommunication networks demands for tailored solutions. Existing streaming analytics systems are designed to operate in large data centers, assuming unlimited bandwi...
展开
Bandwidth efficient execution of online big data analytics in telecommunication networks demands for tailored solutions. Existing streaming analytics systems are designed to operate in large data centers, assuming unlimited bandwidth between data center nodes. Applying these solutions unmodified to distributed telecommunication clouds, overlooks the fact that available bandwidth is a scarce and costly resource making the telecommunication network valuable to end-users. This article presents Continuous Hive (CHive), a streaming analytics platform tailored for distributed telecommunication clouds. The fundamental contribution of CHive is that it optimizes query plans to minimize their overall bandwidth consumption when deployed in a distributed telecommunication cloud. Additionally, these optimized query plans have a high degree of parallelism built-in, benefiting speed of execution. Early experiments on data from a large mobile operator, indicate that CHive can yield bandwidth reductions upwards of 99 percent
收起
摘要 :
Given the sustained growth that we are experiencing in the number of SPARQL endpoints available, the need to be able to send federated SPARQL queries across these has also grown. To address this use case, the W3C SPARQL working gr...
展开
Given the sustained growth that we are experiencing in the number of SPARQL endpoints available, the need to be able to send federated SPARQL queries across these has also grown. To address this use case, the W3C SPARQL working group is defining a federation extension for SPARQL 1.1 which allows for combining graph patterns that can be evaluated over several endpoints within a single query. In this paper, we describe the syntax of that extension and formalize its semantics. Additionally, we describe how a query evaluation system can be implemented for that federation extension, describing some static optimization techniques and reusing a query engine used for data-intensive science, so as to deal with large amounts of intermediate and final results. Finally we carry out a series of experiments that show that our optimizations speed up the federated query evaluation process.
收起
摘要 :
With the explosive use of GPS-enabled devices, increasingly massive volumes of trajectory data capturing the movements of people and vehicles are becoming available, which is useful in many application areas, such as transportatio...
展开
With the explosive use of GPS-enabled devices, increasingly massive volumes of trajectory data capturing the movements of people and vehicles are becoming available, which is useful in many application areas, such as transportation, traffic management, and location-based services. As a result, many trajectory data management and analytic systems have emerged that target either offline or online settings. However, some applications call for both offline and online analyses. For example, in traffic management scenarios, offline analyses of historical trajectory data can be used for traffic planning purposes, while online analyses of streaming trajectories can be adopted for congestion monitoring purposes. Existing trajectory-based systems tend to perform offline and online trajectory analysis separately, which is inefficient. In this paper, we propose a hybrid and efficient framework, called Dragoon, based on Spark, to support both offline and online big trajectory management and analytics. The framework features a mutable resilient distributed dataset model, including RDD Share, RDD Update, and RDD Mirror, which enables hybrid storage of historical and streaming trajectories. It also contains a real-time partitioner capable of efficiently distributing trajectory data and supporting both offline and online analyses. Therefore, Dragoon provides a hybrid analysis pipeline. Support for several typical trajectory queries and mining tasks demonstrates the flexibility of Dragoon. An extensive experimental study using both real and synthetic trajectory datasets shows that Dragoon (1) has similar offline trajectory query performance with the state-of-the-art system UlTraMan; (2) decreases up to doubled storage overhead compared with UlTraMan during trajectory editing; (3) achieves at least 40% improvement of scalability compared with popular streaming processing frameworks (i.e., Flink and Spark Streaming); and (4) offers an average doubled performance improvement for online trajectory data analytics.
收起
摘要 :
With the rapid growth of emerging applications like social network, semantic web, sensor networks and LBS (Location Based Service) applications, a variety of data to be processed continues to witness a quick increase. Effective ma...
展开
With the rapid growth of emerging applications like social network, semantic web, sensor networks and LBS (Location Based Service) applications, a variety of data to be processed continues to witness a quick increase. Effective management and processing of large-scale data poses an interesting but critical challenge. Recently, big data has attracted a lot of attention from academia, industry as well as government. This paper introduces several big data processing techniques from system and application aspects. First, from the view of cloud data management and big data processing mechanisms, we present the key issues of big data processing, including definition of big data, big data management platform, big data service models, distributed file system, data storage, data virtualization platform and distributed applications. Following the MapReduce parallel processing framework, we introduce some MapReduce optimization strategies reported in the literature. Finally, we discuss the open issues and challenges, and deeply explore the research directions in the future on big data processing in cloud computing environments.
收起