2

# Data Analytics

Print publication date:  December  2020
Online publication date:  December  2020

Print ISBN: 9780367537418
eBook ISBN: 9781003083146

10.1201/9781003083146-3

#### Abstract

This chapter presents the concepts and methods for data visualization. A project control scenario is used to illustrate data management for measurement of project performance. The data presentation techniques presented in this chapter are translatable to other data analytics platforms.

#### Data Visualization Methods

Data viewed is data appreciated.

#### Introduction to Data Visualization

Statistical data management is essential for measurement with respect to analyzing and interpreting measurement outputs. In this chapter, a project control scenario is used to illustrate data management for measurement of project performance. The data presentation techniques presented in this chapter are translatable to other data analytics platforms. The present age of computer software, hardware, and tools offers a vast array of techniques for data visualization, beyond what is presented in this chapter. Readers are encouraged to refer to the latest commercial and open-source software for data visualization. More important, the prevalence of cloud-based subscription software products can assist with on-demand data visualization needs. Those online tools should be leveraged at the time of need. The chapter presents only basic and standard methods to spark and guide the interest and awareness of readers.

#### Case Example of “Covidvisualizer” Website

For challenges of interest, such as the COVID-19 pandemic, data visualization can generate an immediate impact of understanding and appreciation, and, consequently, the determination of the lines of action needed. Tracking the fast worldwide spread of coronavirus helped to heighten the necessity and utility of data visualization. In the wake of COVID-19, several online data visualization tools evolved quickly to inform and educate the public about the disease’s spread. One of the earliest such tools was the www.covidvisualizer.com website, which was developed by Navid Mamoon and Gabriel Rasskin, two undergraduate students at Carnegie Mellon University in 2020. The goal of the project is to provide a simple interactive way to visualize the impact of COVID-19. The developers want people to be able to see the effort as something that brings people all together in the collective worldwide fight against COVID-19. The website has a colorful and visually pleasing (almost trance-inducing) rotation of the Earth. Clicking on a country as it rotates by bringing up the country’s up-to-the-minute current statistics for COVID-19. The information displayed includes the following:

• Country name
• Country flag
• Total cases
• Active cases
• Deceased
• Recovered cases
• Line chart (trend line) over time for active, deaths, and recovered

In response to the developers’ solicitation of questions, suggestions, or feedback, I had the pleasure of contacting them to offer the suggestion of adding a search tool to the website. The original website design only has access to each country’s information only when it is clicked during the rotational cycle of geography without the benefit of having written names of the countries. This means that a user has to know which country is which on the world map in order to click on it. Unfortunately, not all users can identify specific countries on the world map. Further, some countries are so tiny that clicking on them on a rotating globe is practically impossible. The idea of a search tool is to improve the user-friendliness of the website by providing a way to search for a specific country of interest. The developers were excited about the feedback and implemented a by-name search tool. The confirmation of below (date March 28, 2020) is the response:

Thank you for requesting the search feature on our website, covidvisualizer.com . We apologize for the delay, (it can take a while to develop a feature like this) but there is now a search function running on the site! You can search by country name or ISO code by simply clicking the new little search icon.

Unfortunately, within 24 hours, the search tool was removed, for which I reengaged with the developers. The response of March 29, 2020, is echoed below:

We unfortunately disabled it, it caused some issues with our server and we'll have to develop it further.

Apparently, adding a search tool caused the website computer server to crash. The developers responded to the suggestion and they developed the visualization tool further. In a subsequent version of the website, the developers included two stable and sustainable search tools, through which a user can search by country name or by scrolling through the alphabetical listing of all countries. The website has enjoyed a consistent worldwide usage since it was introduced in early March 2020. I am delighted and proud that, from a user perspective, I was able to provide mentoring and technical feedback to the website developers. The lesson and moral of this center around the fact that we are all in the fight against COVID-19 together and teamwork is essential for success. In addition, user assessment and feedback are essential for product advancement regardless of whether the product is a commercial product or an open-source tool available free online. Thus, making a contribution to the utility of this very useful website is a proud accomplishment that bears out the theme of this chapter and the entire book.

#### Dynamism and Volatility of Data

The data to be visually presented can be dynamic, volatile, and elusive. The more we can know about the characteristics of the data, the better we can design, evaluate, and implement the technical protocol to handle the data. Transient data is defined as a volatile set of data that is used for one-time decision-making and is not then needed again. An example may be the number of operators that show up at a job site on a given day. Unless there is some correlation between the day-to-day attendance records of operators, this piece of information will have relevance only for that given day. The project manager can make his decision for that day on the basis of that day’s attendance record. Transient data need not be stored in a permanent database unless it may be needed for future analysis or uses (e.g., forecasting, incentive programs, performance review).

Recurring data refers to data that is encountered frequently enough to necessitate storage on a permanent basis. An example is a file containing contract due dates. This file will need to be kept at least through the project life cycle. Recurring data may be further categorized into static data and dynamic data. A recurring data that is static will retain its original parameters and values each time it is retrieved and used. A recurring data that is dynamic has the potential for taking on different parameters and values each time it is retrieved and used. Storage and retrieval considerations for project control should address the following questions:

1. What is the origin of the data?
2. How long will the data be maintained?
4. What will the data be used for?
5. How often will the data be needed?
6. Is the data for look-up purposes only (i.e., no printouts)?
7. Is the data for reporting purposes (i.e., generate reports)?
8. In what format is the data needed?
9. How fast will the data need to be retrieved?
10. What security measures are needed for the data?

#### Data Determination and Collection

It is essential to determine what data to collect for project control purposes. Data collection and analysis are the basic components of generating information for project control. The requirements for data collection are discussed next.

#### Choosing the Data

This involves selecting data on the basis of their relevance and the level of likelihood that they will be needed for future decisions and whether or not they contribute to making the decision better. The intended users of the data should also be identified.

#### Collecting the Data

This identifies a suitable method of collecting the data as well as the source from which the data will be collected. The collection method will depend on the particular operation being addressed. The common methods include manual tabulation, direct keyboard entry, optical character reader, magnetic coding, electronic scanner, and, more recently, voice command. An input control may be used to confirm the accuracy of collected data. Examples of items to control when collecting data are the following:

#### Relevance Check

This checks if the data is relevant to the prevailing problem. For example, data collected on personnel productivity may not be relevant for a decision involving marketing strategies.

#### Limit Check

This checks to ensure that the data is within known or acceptable limits. For example, an employee overtime claim amounting to over 80 hours per week for several weeks in a row is an indication of a record well beyond ordinary limits.

#### Critical Value

This identifies a boundary point for data values. Values below or above a critical value fall in different data categories. For example, the lower specification limit for a given characteristic of a product is a critical value that determines whether or not the product meets quality requirements.

#### Coding the Data

This refers to the technique used in representing data in a form useful for generating information. This should be done in a compact and yet meaningful format. The performance of information systems can be greatly improved if effective data formats and coding are designed into the system right from the beginning.

#### Processing the Data

Data processing is the manipulation of data to generate useful information. Different types of information may be generated from a given data set depending on how it is processed. The processing method should consider how the information will be used, who will be using it, and what caliber of system response time is desired. If possible, processing controls should be used.

#### Control Total

It checks the completeness of the processing by comparing accumulated results to a known total. An example of this is the comparison of machine throughput to a standard production level or the comparison of cumulative project budget depletion to a cost accounting standard.

#### Consistency Check

It checks if the processing is producing the same results for similar data. For example, an electronic inspection device that suddenly shows a measurement that is ten times higher than the norm warrants an investigation of both the input and the processing mechanisms.

#### Scales of Measurement

For numeric scales, specify units of measurement, increments, the zero point on the measurement scale, and the range of values.

#### Using the Information

Using information involves people. Computers can collect data, manipulate data, and generate information, but the ultimate decision rests with people, and decision-making starts when information becomes available. Intuition, experience, training, interest, and ethics are just a few of the factors that determine how people use information. The same piece of information that is positively used to further the progress of a project in one instance may also be used negatively in another instance. To assure that data and information are used appropriately, computer-based security measures can be built into the information system. Project data may be obtained from several sources. Some potential sources are as follows:

• Formal reports
• Interviews and surveys
• Regular project meetings
• Personnel time cards or work schedules

The timing of data is also very important for project control purposes. The contents, level of detail, and frequency of data can affect the control process. An important aspect of project management is the determination of the data required to generate the information needed for project control. The function of keeping track of the vast quantity of rapidly changing and interrelated data about project attributes can be very complicated. The major steps involved in data analysis for project control are as follows:

• Data collection
• Data analysis and presentation
• Decision-making
• Implementation of action

Data is processed to generate information. Information is analyzed by the decision maker to make the required decisions. Good decisions are based on timely and relevant information, which in turn is based on reliable data. Data analysis for project control may involve the following functions:

• Organizing and printing computer-generated information in a form usable by managers
• Integrating different hardware and software systems to communicate in the same project environment
• Incorporating new technologies such as expert systems into data analysis
• Using graphics and other presentation techniques to convey project information

Proper data management will prevent misuse, misinterpretation, or mishandling. Data is needed at every stage in the life cycle of a project from the problem identification stage through the project phase-out stage. The various items for which data may be needed are project specifications, feasibility study, resource availability, staff size, schedule, project status, performance data, and phase-out plan. The documentation of data requirements should cover the following:

• Data summary. A data summary is a general summary of the information and decision for which the data is required as well as the form in which the data should be prepared. The summary indicates the impact of the data requirements on the organizational goals.
• Data processing environment. The processing environment identifies the project for which the data is required, the user personnel, and the computer system to be used in processing the data. It refers to the project request or authorization and relationship to other projects and specifies the expected data communication needs and mode of transmission.
• Data policies and procedures. Data handling policies and procedures describe policies governing data handling, storage, and modification and the specific procedures for implementing changes to the data. Additionally, they provide instructions for data collection and organization.
• Static data. A static data description describes that portion of the data that is used mainly for reference purposes and it is rarely updated.
• Dynamic data. A dynamic data description describes that portion of the data that is frequently updated based on the prevailing circumstances in the organization.
• Data frequency. The frequency of data update specifies the expected frequency of data change for the dynamic portion of the data, for example, quarterly. This data change frequency should be described in relation to the frequency of processing.
• Data constraints. Data constraints refer to the limitations on the data requirements. Constraints may be procedural (e.g., based on corporate policy), technical (e.g., based on computer limitations), or imposed (e.g., based on project goals).
• Data compatibility. Data compatibility analysis involves ensuring that data collected for project control needs will be compatible with future needs.
• Data contingency. A data contingency plan concerns data security measures in case of accidental or deliberate damage or sabotage affecting hardware, software, or personnel.

#### Data Exploitation

Data availability should be exploited and leverage for pertinent decision-making. Data exploitation refers to the various mathematical and graphical operations that can be performed on data to elicit the inherent information contained in the data. The manner in which project data is analyzed and presented can affect how the information is perceived by the decision maker. The examples presented in this section illustrate how basic data analysis techniques can be used to convey important information for project control.

In many cases, data is represented as the answer to direct questions such as the following: When is the project deadline? Who are the people assigned to the first task? How many resource units are available? Are enough funds available for the project? What are the quarterly expenditures on the project for the past two years? Is personnel productivity low, average, or high? Who is the person in charge of the project? Answers to these types of questions constitute data of different forms or expressed on different scales. The resulting data may be qualitative or quantitative. Different techniques are available for analyzing the different types of data. This section discusses some of the basic techniques for data analysis. The data presented in Table 3.1 is used to illustrate the data analysis techniques.

### Table 3.1   Quarterly Revenue from Four Projects (in $1,000s) Project Quarter 1 Quarter 2 Quarter 3 Quarter 4 Row Total A 3,000 3,200 3,400 2,800 12,400 B 1,200 1,900 2,500 2,400 8,000 C 4,500 3,400 4,600 4,200 16,700 D 2,000 2,500 3,200 2,600 10,300 Total 10,700 11,000 13,700 12,000 47,400 #### Raw Data Raw data consists of ordinary observations recorded for a decision variable or factor. Examples of factors for which data may be collected for decision-making are revenue, cost, personnel productivity, task duration, project completion time, product quality, and resource availability. Raw data should be organized into a format suitable for visual review and computational analysis. The data in Table 3.1 represents the quarterly revenues from projects A, B, C, and D. For example, the data for quarter 1 indicates that project C yielded the highest revenue of$4,500,000, while project B yielded the lowest revenue of $1,200,000. Figure 3.1 presents the raw data of project revenue as a line graph. The same information is presented as a multiple bar chart in Figure 3.2. Figure 3.1 Line graph of quarterly project revenues. Figure 3.2 Multiple bar chart of quarterly project revenues. #### Total Revenue A total or sum is a measure that indicates the overall effect of a particular variable. If X 1, X 2, X 3, , X n represent a set of n observations (e.g., revenues), then the total is computed as follows: $T = Σ i = 1 n X i$ For the data in Table 3.1, the total revenue for each project is shown in the last column. The totals indicate that project C brought in the largest total revenue over the four quarters under consideration, while project B produced the lowest total revenue. The last row of the table shows the total revenue for each quarter. The totals reveal that the largest revenue occurred in the third quarter. The first quarter brought in the lowest total revenue. The grand total revenue for the four projects over the four quarters is shown as$47,400,000 in the last cell in the table. The total revenues for the four projects over the four quarters are shown in a pie chart in Figure 3.3. The percentage of the overall revenue contributed by each project is also shown on the pie chart.

Figure 3.3   Pie chart of total revenue per project.

#### Average Revenue

Average is one of the most used measures in data analysis. Given n observations (e.g., revenues), X 1, X 2, X 3, , X n, the average of the observations is computed as

$X ¯ = ∑ i = 1 n X i n$
$= T x n$

where T x is the sum of n revenues. For our sample data, the average quarterly revenues for the four projects are

$X ¯ A = ( 3 , 000 + 3 , 200 + 3 , 400 + 2 , 800 ) ( 1 , 000 ) 4 = 3 , 100 , 000$

$X ¯ B = ( 1 , 200 + 1 , 900 + 2 , 500 + 2 , 400 ) ( 1 , 000 ) 4 = 2 , 000 , 000$
$X ¯ C = ( 4 , 500 + 3 , 400 + 4 , 600 + 4 , 200 ) ( 1 , 000 ) 4 = 4 , 175 , 000$
$X ¯ D = ( 2 , 000 + 2 , 500 + 3 , 200 + 2 , 600 ) ( 1 , 000 ) 4 = 2 , 575 , 000$

Similarly, the expected average revenues per project for the four quarters are

$X ¯ 1 = ( 3 , 000 + 1 , 200 + 4 , 500 + 2 , 000 ) ( 1 , 000 ) 4 = 2 , 675 , 000$
$X ¯ 2 = ( 3 , 200 + 1 , 900 + 3 , 400 + 2 , 500 ) ( 1 , 000 ) 4 = 2 , 750 , 000$
$X ¯ 3 = ( 3 , 400 + 2 , 500 + 4 , 600 + 3 , 200 ) ( 1 , 000 ) 4 = 3 , 425 , 000$
$X ¯ 4 = ( 2 , 800 + 2 , 400 + 4 , 200 + 2 , 600 ) ( 1 , 000 ) 4 = 3 , 000 , 000$

The above values are shown in a bar chart in Figure 3.4. The average revenue from any of the four projects in any given quarter is calculated as the sum of all the observations divided by the number of observations. That is,

Figure 3.4   Average revenue per project for each quarter.

$X ¯ ¯ = ∑ i = 1 N ∑ j = 1 M X i j K$

where

• N is the number of projects.
• M is the number of quarters.
• K is the total number of observations (K  =  NM).

The overall average per project per quarter is

$X ¯ ¯ = 47 , 400 , 000 16 = 2 , 962 , 500$

As a cross-check, the sum of the quarterly averages should be equal to the sum of the project revenue averages, which is equal to the grand total divided by 4.

$( 2 , 675 + 2 , 750 + 3 , 425 + 3 , 000 ) ( 1 , 000 ) = ( 3 , 100 + 2 , 000 + 4 , 175 + 2 , 575 ) ( 1 , 000 ) = 11 , 800 , 000 = 47 , 400 , 000 / 4$

The cross-check procedure above works because we have a balanced table of observations. That is, we have four projects and four quarters. If there were only three projects, for example, the sum of the quarterly averages would not be equal to the sum of the project averages.

#### Median Revenue

The median is the value that falls in the middle of a group of observations arranged in order of magnitude. One-half of the observations are above the median, and the other half are below the median. The method of determining the median depends on whether or not the observations are organized into a frequency distribution. For unorganized data, it is necessary to arrange the data in an increasing or decreasing order before finding the median. Given K observations (e.g., revenues), X 1, X 2, X 3, , X K, arranged in increasing or decreasing order, the median is identified as the value in position (K + 1)/2 in the data arrangement if K is an odd number. If K is an even number, then the average of the two middle values is considered to be the median. If the sample data are arranged in increasing order, we would get the following:

1,200, 1,900, 2,000, 2,400, 2,500, 2,500, 2,600, 2,800, 3,000, 3,200, 3,200, 3,400, 3,400, 4,200, 4,500, and 4,600

### Table 3.2   Average Deviation, Standard Deviation, and Variance

 Observation Number ( i ) Recorded Observation X i Deviation from Average $X i − X ¯$ Absolute Value $| X i − X ¯ |$ Square of Deviation $( X i − X ¯ ) 2$ 1 3,000 37.5 37.5 1,406.25 2 1,200 −1,762.5 1762.5 3,106,406.30 3 4,500 1,537.5 1537.5 2,363,906.30 4 2,000 −962.5 962.5 926,406.25 5 3,200 237.5 237.5 56,406.25 6 1,900 −1,062.5 1062.5 1,128,906.30 7 3,400 437.5 437.5 191,406.25 8 2,500 −462.5 462.5 213,906.25 9 3,400 437.5 437.5 191,406.25 10 2,500 −462.5 462.5 213,906.25 11 4,600 1,637.5 1637.5 2,681,406.30 12 3,200 237.5 237.5 56,406.25 13 2,800 −162.5 162.5 26,406.25 14 2,400 −562.5 562.5 316,406.25 15 4,200 1,237.5 1237.5 1,531,406.30 16 2,600 −362.5 362.5 131,406.25 Total 47,400.0 0.0 11,600.0 13,137,500.25 Average 2,962.5 0.0 725.0 821,093.77 Square root — — — 906.14

The range is determined by the two extreme values in a set of observations. Given K observations (e.g., revenues), X 1, X 2, X 3, …, X K, the range of the observations is simply the difference between the lowest and the highest observations. This measure is useful when the analyst wants to know the extent of extreme variations in a parameter. The range of the revenues in our sample data is ($4,600,000 −$1,200,000) =  $3,400,000. Because of its dependence on only two values, the range tends to increase as the sample size increases. Furthermore, it does not provide a measurement of the variability of the observations relative to the center of the distribution. This is why the standard deviation is normally used as a more reliable measure of dispersion than the range. The variability of a distribution is generally expressed in terms of the deviation of each observed value from the sample average. If the deviations are small, the set of data is said to have low variability. The deviations provide information about the degree of dispersion in a set of observations. A general formula to evaluate the variability of data cannot be based on the deviations. This is because some of the deviations are negative, whereas some are positive and the sum of all the deviations is equal to 0. One possible solution to this is to compute the average deviation. #### Average Deviation The average deviation is the average of the absolute values of the deviations from the sample average. Given K observations (e.g., revenues), X 1, X 2, X 3, …, X K, the average deviation of the data is computed as $D ¯ = ∑ i = 1 K | X i − X ¯ | K$ Table 3.2 shows how the average deviation is computed for our sample data. One aspect of the average deviation measure is that the procedure ignores the sign associated with each deviation. Despite this disadvantage, its simplicity and ease of computation make it useful. In addition, the knowledge of the average deviation helps in understanding the standard deviation, which is the most important measure of dispersion available. #### Sample Variance Sample variance is the average of the squared deviations computed from a set of observations. If the variance of a set of observations is large, the data is said to have a large variability. For example, a large variability in the levels of productivity of a project team may indicate a lack of consistency or improper methods in the project functions. Given K observations (e.g., revenues), X 1, X 2, X 3, , X K, the sample variance of the data is computed as $s 2 = ∑ i = 1 K ( X i − X ¯ ) 2 K − 1$ The variance can also be computed by the following alternate formulas: $s 2 = ∑ i = 1 K ( X i 2 − ( 1 K ) ) [ ∑ i = 1 K X i ] 2 K − 1$ $s 2 = ∑ i = 1 K X i 2 − K ( X ¯ 2 ) K − 1$ Using the first formula, the sample variance of the data in Table 3.2 is calculated as $S 2 = 13 , 137 , 500.25 16 − 1 = 875 , 833.33$ The average calculated in the last column of Table 3.1 is obtained by dividing the total for that column by 16 instead of 16 − 1 = 15. That average is not the correct value of the sample variance. However, as the number of observations gets very large, the average as computed in the table will become a close estimate for the correct sample variance. Analysts make a distinction between the two values by referring to the number calculated in the table as the population variance when K is very large and referring to the number calculated by the formulas above as the sample variance particularly when K is small. For our example, the population variance is given by $σ 2 = ∑ i = 1 K ( X i − X ¯ ) 2 K = 13 , 137 , 500.25 16 = 821 , 093.77$ while the sample variance, as shown previously for the same data set, is given by $σ 2 = ∑ i = 1 K ( X i − X ¯ ) 2 K − 1 = 13 , 137 , 500.25 ( 16 − 1 ) = 875 , 833.33$ #### Standard Deviation The sample standard deviation of a set of observations is the positive square root of the sample variance. The use of variance as a measure of variability has some drawbacks. For example, the knowledge of the variance is helpful only when two or more sets of observations are compared. Because of the squaring operation, the variance is expressed in square units rather than the original units of the raw data. To get a reliable feel for the variability in the data, it is necessary to restore the original units by performing the square root operation on the variance. This is why standard deviation is a widely recognized measure of variability. Given K observations (e.g., revenues), X 1, X 2, X 3,…, X K, the sample standard deviation of the data is computed as $s = ∑ i = 1 K ( X i − X ¯ ) 2 K − 1$ As in the case of the sample variance, the sample standard deviation can also be computed by the following alternate formulas: $s = ∑ i = 1 K X i 2 − ( 1 K ) [ ∑ i = 1 K X i ] 2 K − 1$ $s = ∑ i = 1 K X i 2 − K ( X ¯ ) 2 K − 1$ Using the first formula, the sample standard deviation of the data is calculated as $s = 13 , 137 , 500.25 ( 16 − 1 ) = 875 , 833.33 = 935.8597$ We can say that the variability in the expected revenue per project per quarter is$935,859.70. The population sample standard deviation is given by the following:

$σ = ∑ i = 1 K ( X i − X ¯ ) 2 K = 13 , 137 , 500.25 16 = 821 , 093.77 = 906.1423$

The sample standard deviation is given by the following expression:

$s = ∑ i = 1 K ( X i − X ¯ ) 2 K − 1 = 13 , 137 , 500.25 ( 16 − 1 ) = 935.8597$

The results of data analysis can be reviewed directly to determine where and when project control actions may be needed. The results can also be used to generate control charts, as illustrated in Chapter 1 for my high school course grades.

## Use of cookies on this website

We are using cookies to provide statistics that help us give you the best experience of our site. You can find out more in our Privacy Policy. By continuing to use the site you are agreeing to our use of cookies.