Archival Data

Primal technical concepts

John Sammons , in The Basics of Digital Forensics (Second Edition), 2015

Archival data

Archival information, or backups, can take many forms. External hard drives, DVDs, and backup tapes are but a few examples. Conquering of archival data can range from simple to extremely complex. The type and age of the backup media are major factors in determining the complication of the process.

Fill-in tapes can present some very big challenges, especially if they were made with software or hardware that is no longer in product. Tapes are created using specific pieces of hardware and software. These same tools will be needed to restore the data into a form that can be understood and manipulated. Where it gets really heady is when the hardware and software are no longer in production. An older version of the software may no longer exist available or the company is no longer in business. This is known as legacy information. What do you lot practise if you lot no longer have, and tin can't get access to, the necessary tools to restore the data? Sometimes eBay can salvage the day.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128016350000024

Secondary Data

Marc Riedel , in Encyclopedia of Social Measurement, 2005

Documentation

How acceptable is the documentation? Archival information typically provide more adequate documentation than official statistics taken directly from agencies. The nigh problematic are official records in which abbreviations or autograph is used in the records. In the latter instance, information technology is necessary to hash out with records managers what each of the symbols mean.

A major documentation problem is indications of missing data. Conventionally, missing values is a argument of ignorance: nosotros do not know what the value is. Frequently, blanks are used to indicate missing data, merely the significant of the blank is unclear. For example, if information is missing about an offender's previous criminal history, does that mean he or she had no previous criminal history or does information technology mean the information was lost or never nerveless?

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B0123693985000712

Criminology

Chester L. Britt , in Encyclopedia of Social Measurement, 2005

Courts

One of the classic studies of court operations is provided in Eisenstein and Jacob'south 1977 analysis of courts in Baltimore, Chicago, and Detroit. In addition to using official archival information on cases candy in each of these three jurisdictions, they spent many hours in courtrooms taking notes on the activities and interactions occurring in the court. They followed their detailed observations by interviewing judges, prosecutors, and defense force attorneys to accept each explain what was taking identify in the court and why. Thus, where prior research and Eisenstein and Jacob's ain analysis of archival information had been able to document statistical regularities in court decision-making, such as the effect of pleading guilty on the severity of punishment, the utilise of systematic ascertainment and in-depth interviewing was able to provide an explanation for why this link existed.

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B0123693985002723

Cloud-Based Smart-Facilities Management

S. Majumdar , in Internet of Things, 2016

17.6 Resource management techniques for supporting data analytics

On a smart facility, analyses of sensor information, as well equally archived maintenance information, are both important for its effective management. Using batch-data-analytics techniques on archived data is to be performed, for instance, to determine the next maintenance cycle, whereas real-fourth dimension data analytics that concerns the processing of sensor data in existent time may exist of import for performing real-time control of the facility or for handling emergencies. MapReduce is a well-known technique [33] that is used for performing information analytics on large volumes of information that are typical of smart facilities. The bones thought behind MapReduce is briefly explained.

The input data is divided into chunks, each of which is handled past a separate map task. Multiple map tasks, each treatment a specific chunk of input data, are executed concurrently on a parallel system, such equally a cluster and a cloud. The outputs of the different map tasks are and then combined with the help of several reduce tasks that run concurrently on the arrangement. Although the same MapReduce architecture is used, the application logic for the map and reduce tasks can vary from 1 facility to some other. Effective resource allotment of processors to tasks and task scheduling are crucial for achieving loftier system-functioning. Resource direction techniques for chore resource allotment and scheduling for MapReduce systems that process jobs on a best-effort basis are thoroughly studied. Associating a Service Level Understanding (SLA) that includes a deadline with MapReduce jobs has recently started receiving attention [20,21,34]. The ability to associate a deadline with a job is important for performing real-time data analytics, including the existent-time processing of event logs nerveless on the facility. Resource management is known to be a computationally hard problem. Association of a deadline, and the availability of multiple resources in a deject, used, for example, for the deployment of the MapReduce framework that is characterized past multiple phases of operation, further complicates the problem. Innovative algorithms for resource allocation, and scheduling for handling a batch of MapReduce jobs with deadlines are described in [35]. The authors propose two unlike approaches, based on optimization techniques for resource management: Mixed Integer Linear Programming (MILP) and Constraint programming (CP). The MILP-based resource management algorithm is implemented using LINGO [36], whereas IBM ILOG CPLEX [37] is used in implementing the CP-based algorithm. The results of a simulation-based performance evaluation presented in [35] demonstrate the superiority of the CP-based technique (Fig. 17.5). The effigy displays the results for two of the largest of the five workloads used in the research. Large 1 corresponds to a batch of 2 jobs, with each job having 100 map tasks and 30 reduce tasks, whereas Large 2 corresponds to a batch of fifty jobs, with each job having a number of map tasks ranging from 1 to 100, and a number of reduce tasks ranging from 1 to the number of map tasks in the corresponding job. Further details of the workload and organization parameters are provided in [35]. The completion fourth dimension for the batch, as well every bit the processing fourth dimension for the resource management algorithm (arrangement overhead incurred), are much lower for CP (Approach iii in Fig. 17.5) in comparison to MILP (Approach 1 in Fig. 17.five). Note that among the 2 approaches, but the CP-based technique could handle the Large 2 workload. Following the success of the CP-based approach in the case of batch processing, the authors devised a CP-based technique for handling MapReduce jobs, with SLAs for clouds subjected to an open stream of arrivals of MapReduce jobs [29]. The high functioning of their CP-based algorithm is reflected in the depression number of jobs with missed deadlines, reported in a simulation-based investigation. Validation of the effectiveness of their algorithm on real Hadoop clusters has too been performed.

Effigy 17.5. Performance of Different Resource Management Approaches for a Arrangement Running MapReduce Jobs (from Ref. [35])

17.half dozen.1 Streaming data analytics

Batch, real-time, and streaming information-analytics are important in the context of analyzing data nerveless on smart facilities. As discussed before, batch analytics is performed on stored archival information, whereas existent-time analytics is needed when an event (eg, a tempest) occurs, requiring the analysis of the effect of the event on the smart facility in real time. MapReduce and MapReduce with deadlines can be used in these two situations respectively. Streaming information analytics is required when streams of sensor data need to be analyzed continuously for determining the health of the system, for example. Parallel-processing frameworks such as Storm [38] have been developed for performing streaming information-analytics. Resources management for achieving effective streaming data-analytics has started receiving attention from researchers recently. Existing piece of work includes using parallel processing to provide Quality of Service (QoS) guarantees for stream processing, described in [39]. A reactive scaling strategy to enforce latency constraints on the computation performed on a Stream Processing Engine is presented in [40]. No permanent static provisioning is assumed, and the technique can finer handle varying workloads. Resources management for systems supporting streaming analytics is an important problem and needs further investigation.

Read total chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780128053959000174

Investigating software modularity using class and module level metrics

Michael English language , ... J.J. Collins , in Software Quality Balls, 2016

8.10.2 Data Collection and Analysis

The target of our analysis is the modules in the Open Source Software system Weka. These are the classes, packages, and directories in the organization, and we use third degree, or archival information ( Benbasat et al., 1987; Lethbridge et al., 2005) in the form of software metrics to accost our research questions.

In terms of characterizing modularity with respect to HLMs, we select directories as candidate HLMs, just any arbitrary group of java classes and interfaces could be employed. For these directories, we evaluate their campaigning every bit HLMs using the Cluster Factor and Distance metrics as defined previously.

Nosotros look at the top-ranked directories initially in terms of Cluster Factor. The Distance metric, which reports on the caste to which a module is well formed with respect to Abstractness and Instability, is employed to farther triangulate the results towards candidate high-module identification (Stake and Savolainen, 1995), equally per adept case written report practice (Runeson and Höst, 2009).

In addition, we traverse the directory structure to look for directories where the Cluster Factor increases substantially in moving from ane level to another, suggesting that that directory sub-tree rooted at the node with the higher Cluster Factor is a good candidate every bit a HLM.

In terms of assessing the power of high-level metric information to inform on lower level constructs, we use coupling metrics to see if those that are classified as candidate HLMs inform on good candidate lower level modules and if those that are contraindicated equally candidate HLMs inform on lower level modules with respect to coupling.

The source code metrics were extracted using Understand (Scitools, 2015) and a number of spreadsheet calculations, which sabbatum on meridian of the analyses provided by Empathize.

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780128023013000089

Big Data Analytics on a Smart Filigree

S. Wallace , ... K.-T. Lu , in Big Data, 2016

17.iii Improving Traditional Workflow

A traditional workflow involving the archival PMU/PDC data may unfold along the post-obit lines:

i.

A researcher is notified of an interesting miracle at some betoken in recent history.

2.

The researcher obtains archival data for a span of several minutes based on the approximate time of the phenomena.

3.

The researcher plots or otherwise analyzes the signal streams over the retrieved timespan to isolate the time and signal locations at which the phenomena are virtually readily observed.

4.

The researcher finally creates plots and performs an analysis over a very specific time range and narrow gear up of PMU signals every bit appropriate for the task at hand.

While this workflow is entirely reasonable, it likely leaves the bulk of the data unanalyzed. We view this equally a expert opportunity to leverage the methods of Large Data analytics and machine learning to turn the latent information in this archived data into useful noesis.

Specifically, we see the following opportunities:

characterizing normal operation,

identifying unusual phenomena, and

identifying known events.

These operations can all exist performed on the historic archive of PMU data using Hadoop/Spark. Typically this would be done at the data center housing that content. One time normal operation has been appropriately characterized, and identification has been performed on the historic data, identification of unusual phenomena and known events can proceed on the incoming (live) PMU data stream in real time with minimal hardware requirements.

Read total chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780128053942000179

Climate, History of

C. Pfister , in International Encyclopedia of the Social & Behavioral Sciences, 2001

iv.7 ENSO

The El-Niño Southern Oscillation (ENSO) is the outcome of a cyclic warming and cooling of the ocean surface in the primal and eastern Pacific that strongly affects rainfall in the areas around the Pacific and the Indian Body of water. Archival information suggests that ENSO episodes from 1600 to 1900 had more intense and global effect than those of the twentieth century. For case, the worst droughts in the colonial history of Republic of india (mid 1590s, 1629 to 1633, 1685 to 1688, 1788 to 1793, 1877 to 1878) are related to ENSO continued failures of the monsoon. For the last two events the global dimension of these episodes is demonstrated (Grove and Chappell 2000).

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B0080430767026644

Pentest Project Direction

Thomas Wilhelm , in Professional Penetration Testing (Second Edition), 2013

Archival Locations

If nosotros plan on archiving data, nosotros need to think about disaster recovery and business organisation continuity planning, which can become quite complicated as risks are identified in the archiving process. Permit's say that we desire to archive data; storing archival information in the same room or building equally the system that used to retain the information is commonly a bad thought. We determine that the archived penetration exam data demand to exist stored in a secure facility that is geographically disparate from the location of the arrangement beingness archived due to the e'er-present threat of natural and human-made disasters. Another consideration is that we need two copies—one relocated elsewhere and the other locally, in example nosotros need quick access.

Are You Owned?

Data Archive Nightmare

I one time had a conversation with a network administrator of a software development store about his archival procedure of the corporate software evolution repository server. He had been archiving information for years and felt their data was safe. The information had never been verified for integrity, just because the record archival system kept indicating that the backups were successful, everything was fine. Nosotros ran a test and constitute out that nearly of the tapes were blank. Turns out that the arrangement ambassador had turned off the archival customer on the code repository system considering "it slowed the system downward"; the network administrator was not alerted to this trouble considering the backup system'due south default response to a nonresponsive client was to pass over the nonresponsive client and move onto the next system. At the terminate of the archival process, the archival system would create a note in its log that some systems (including the code repository system) had not been archived, but that the overall backup was "successful." Because the network administrator never looked into the details of the study and but paid attending to the success notice, they assumed everything worked.

Once we decide to relocate the data, we realize that even though relocating archival data to an off-site location reduces one take chances (loss of data through local disaster), it introduces another risk (unauthorized access) considering the data is transported and stored elsewhere. If the data are encrypted before transit, we can mitigate the new run a risk, but now we demand to have a way of decrypting the information remotely, in example we lose all our systems locally. If nosotros archived data using a record backup archival organization, such equally VERITAS, we need to learn a second system for the second prepare of archival data for our alternate location. Naturally, we need to ship the encryption central, then we can decrypt the data later if needed—we can't send the key during transit of the data, in case the data get stolen along the mode.

Now we have information located in 2 locations, how do we admission the 2d set of data? Nosotros demand remote staff to perform the process, which ways we demand to railroad train them on how to decrypt data and secure the data properly. Once the data are decrypted, is in that location a secure facility to store the data, and what kind of physical security exists? Now nosotros accept to recall most guns, gates, and guards, which also mean background checks, concrete penetration tests, and so on.

As we can come across, archiving data are non a simple procedure—in that location are many factors to consider. Nosotros must have a process that keeps our customer'due south data secure, no matter where information technology is stored.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781597499934000057

Coding Variables

Lee Epstein , Andrew Martin , in Encyclopedia of Social Measurement, 2005

Introduction

Social scientists engaged in empirical enquiry—that is, research seeking to make claims or inferences based on observations of the real world—undertake an enormous range of activities. Some investigators collect information from primary sources; others rely primarily on secondary archival data. Many practice niggling more than categorize the information they collect; but many more deploy complex technologies to analyze their data.

Seen in this way, information technology might appear that, beyond following some basic rules of inference and guidelines for the conduct of their research, scholars producing empirical piece of work have footling in mutual. Their data come up from a multitude of sources; their tools for making use of the data are equally varied. But at that place exists at least 1 task in empirical scholarship that is universal, that virtually all scholars and their students perform every time they undertake a new project: coding variables, or the process of translating properties or attributes of the world (i.eastward., variables) into a form that researchers tin systematically clarify afterwards they have chosen the appropriate measures to tap the underlying variable of interest. Regardless of whether the data are qualitative or quantitative, regardless of the form the analyses take, near all researchers seeking to make claims or inferences based on observations of the real world engage in the process of coding information. That is, after measurement has taken place, they (1) develop a precise schema to account for the values on which each variable of involvement can accept and then (2) methodically and physically assign each unit nether report a value for every given variable.

And yet, despite the universality of the task (not to mention the central part it plays in research), it typically receives simply the briefest mention in most volumes on designing research or analyzing data. Why this is the case is a question on which we can only speculate, just an obvious response centers on the seemingly idiosyncratic nature of the undertaking. For some projects, researchers may be all-time off coding inductively, that is, collecting their information, drawing a representative sample, examining the data in the sample, and so developing their coding scheme; for others, investigators proceed in a deductive style, that is, they develop their schemes offset and then collect/lawmaking their data; and for notwithstanding a third set up, a combination of inductive and deductive coding may be most appropriate. (Some writers acquaintance inductive coding with inquiry that primarily relies on qualitative [nonnumerical] data/research and deductive coding with quantitative [numerical] research. Given the [typically] dynamic nature of the processes of collecting data and coding, however, these associations do not ever or perhaps even commonly hold. Indeed, information technology is probably the ease that most researchers, regardless of whether their data are qualitative or quantitative, invoke some combination of deductive and anterior coding.) The relative ease (or difficulty) of the coding task also can vary, depending on the types of data with which the researcher is working, the level of detail for which the coding scheme calls, and the corporeality of pretesting the analyst has conducted, to name simply iii.

All the same, we believe it is possible to develop some generalizations about the procedure of coding variables, too as guidelines for so doing. This much we effort to attain here. Our word is divided into ii sections, respective to the 2 key phases of the coding procedure: (1) developing a precise schema to account for the values of the variables and (two) methodically assigning each unit under study a value for every given variable. Readers should be aware, notwithstanding, that although we fabricated as much use equally we could of existing literatures, discussions of coding variables are sufficiently few and far between (and where they do exist, rather scanty) that many of the generalizations we make and the guidelines nosotros offer come largely from our own experience. Accordingly, sins of commission and omission probably loom large in our discussion (with the latter particularly likely in lite of space limitations).

Read full affiliate

URL:

https://www.sciencedirect.com/science/article/pii/B0123693985003194

XML and Information Engineering

Peter Aiken , David Allen , in XML in Data Management, 2004

XML and Metadata Modeling

When you have prepared the information quality engineering aspects of your metadata engineering analysis, you can begin the process of extracting metadata. Use the Mutual Metadata Model (CM2) described beneath as the footing for organizing your data within each framework prison cell. If you do not implement the right or perfect format for the information structure, yous can take XML transformations developed to correct and improve your existing metadata. The CM2 forms the basis for developing the XML-based data structures according to 3 specific model decompositions. In this department, we present an overview of the metamodel framework (see Effigy 4.8). More on the common metadata model, including business rule translations, tin can be establish in Chapter 9 of Finkelstein and Aiken (1998).

Effigy 4.8. Iii primary and seven 2nd level model view decompositions used to organize the arrangement metadata. This comprises the required opposite technology analysis and conceptual structure decompositions.

The same figure besides illustrates the required metadata construction to support basic repository access requirements. In that location are two cardinal success factors to each XML-based metadata reengineering analysis. First, the analysis chunks should be sized in a way that makes results piece of cake to show. Second, the analysts should exist rather finicky about which framework and common metadata model components become the most focus. Smaller, more focused projects make it easier to demonstrate results.

The fix of optional, contextual extensions can also exist managed in a divide decomposition. Some of these metadata items may be easily gathered during the procedure and should be gathered if they are relevant to the assay. Contextual extensions might include:

Data usage type—operational or decision support

Data residency—functional, subject, geographic, etc.

Data archival type—continuous, effect-discrete, periodic-discrete

Data granularity blazon—defining the smallest unit of addressable data as an aspect, an entity, or another unit of measure

Data admission frequency—measured by accesses per measurement menstruum

Data access probability—probability that an individual logical data attribute will exist accessed during a processing period

Data update probability—probability that a logical information attribute will be updated during a processing flow

Data integration requirements—the number and possible classes of integration points

Data field of study types—the number and possible subject field area breakdowns

Data location types—the number and availability of possible node locations

Data stewardship—the concern unit charged with maintaining the logical data attribute

Information attribute system component of record—the system component responsible for maintaining the logical data attribute

Read total chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780120455997500042