This chapter presents the concepts and methods for data visualization. A project control scenario is used to illustrate data management for measurement of project performance. The data presentation techniques presented in this chapter are translatable to other data analytics platforms.
Data viewed is data appreciated.
Statistical data management is essential for measurement with respect to analyzing and interpreting measurement outputs. In this chapter, a project control scenario is used to illustrate data management for measurement of project performance. The data presentation techniques presented in this chapter are translatable to other data analytics platforms. The present age of computer software, hardware, and tools offers a vast array of techniques for data visualization, beyond what is presented in this chapter. Readers are encouraged to refer to the latest commercial and open-source software for data visualization. More important, the prevalence of cloud-based subscription software products can assist with on-demand data visualization needs. Those online tools should be leveraged at the time of need. The chapter presents only basic and standard methods to spark and guide the interest and awareness of readers.
For challenges of interest, such as the COVID-19 pandemic, data visualization can generate an immediate impact of understanding and appreciation, and, consequently, the determination of the lines of action needed. Tracking the fast worldwide spread of coronavirus helped to heighten the necessity and utility of data visualization. In the wake of COVID-19, several online data visualization tools evolved quickly to inform and educate the public about the disease’s spread. One of the earliest such tools was the www.covidvisualizer.com website, which was developed by Navid Mamoon and Gabriel Rasskin, two undergraduate students at Carnegie Mellon University in 2020. The goal of the project is to provide a simple interactive way to visualize the impact of COVID-19. The developers want people to be able to see the effort as something that brings people all together in the collective worldwide fight against COVID-19. The website has a colorful and visually pleasing (almost trance-inducing) rotation of the Earth. Clicking on a country as it rotates by bringing up the country’s up-to-the-minute current statistics for COVID-19. The information displayed includes the following:
In response to the developers’ solicitation of questions, suggestions, or
feedback, I had the pleasure of contacting them to offer the suggestion of adding
a search tool to the website. The original website design only has access to each
country’s information only when it is clicked during the rotational cycle of
geography without the benefit of having written names of the countries. This means
that a user has to know which country is which on the world map in order to click
on it. Unfortunately, not all users can identify specific countries on the world
map. Further, some countries are so tiny that clicking on them on a rotating globe
is practically impossible. The idea of a search tool is to improve the
user-friendliness of the website by providing a way to search for a specific
country of interest. The developers were excited about the feedback and
implemented a by-name search tool. The confirmation of below (date March 28, 2020)
is the response:
Thank you for requesting the search feature on our website, covidvisualizer.com . We apologize for the delay, (it can take a while to develop a feature like this) but there is now a search function running on the site! You can search by country name or ISO code by simply clicking the new little search icon.
Unfortunately, within 24 hours, the search tool was removed, for which I reengaged
with the developers. The response of March 29, 2020, is echoed below:
We unfortunately disabled it, it caused some issues with our server and we'll have to develop it further.
Apparently, adding a search tool caused the website computer server to crash. The developers responded to the suggestion and they developed the visualization tool further. In a subsequent version of the website, the developers included two stable and sustainable search tools, through which a user can search by country name or by scrolling through the alphabetical listing of all countries. The website has enjoyed a consistent worldwide usage since it was introduced in early March 2020. I am delighted and proud that, from a user perspective, I was able to provide mentoring and technical feedback to the website developers. The lesson and moral of this center around the fact that we are all in the fight against COVID-19 together and teamwork is essential for success. In addition, user assessment and feedback are essential for product advancement regardless of whether the product is a commercial product or an open-source tool available free online. Thus, making a contribution to the utility of this very useful website is a proud accomplishment that bears out the theme of this chapter and the entire book.
The data to be visually presented can be dynamic, volatile, and elusive. The more we can know about the characteristics of the data, the better we can design, evaluate, and implement the technical protocol to handle the data. Transient data is defined as a volatile set of data that is used for one-time decision-making and is not then needed again. An example may be the number of operators that show up at a job site on a given day. Unless there is some correlation between the day-to-day attendance records of operators, this piece of information will have relevance only for that given day. The project manager can make his decision for that day on the basis of that day’s attendance record. Transient data need not be stored in a permanent database unless it may be needed for future analysis or uses (e.g., forecasting, incentive programs, performance review).
Recurring data refers to data that is encountered frequently enough to necessitate storage on a permanent basis. An example is a file containing contract due dates. This file will need to be kept at least through the project life cycle. Recurring data may be further categorized into static data and dynamic data. A recurring data that is static will retain its original parameters and values each time it is retrieved and used. A recurring data that is dynamic has the potential for taking on different parameters and values each time it is retrieved and used. Storage and retrieval considerations for project control should address the following questions:
It is essential to determine what data to collect for project control purposes. Data collection and analysis are the basic components of generating information for project control. The requirements for data collection are discussed next.
This involves selecting data on the basis of their relevance and the level of likelihood that they will be needed for future decisions and whether or not they contribute to making the decision better. The intended users of the data should also be identified.
This identifies a suitable method of collecting the data as well as the source from which the data will be collected. The collection method will depend on the particular operation being addressed. The common methods include manual tabulation, direct keyboard entry, optical character reader, magnetic coding, electronic scanner, and, more recently, voice command. An input control may be used to confirm the accuracy of collected data. Examples of items to control when collecting data are the following:
This checks if the data is relevant to the prevailing problem. For example, data collected on personnel productivity may not be relevant for a decision involving marketing strategies.
This checks to ensure that the data is within known or acceptable limits. For example, an employee overtime claim amounting to over 80 hours per week for several weeks in a row is an indication of a record well beyond ordinary limits.
This identifies a boundary point for data values. Values below or above a critical value fall in different data categories. For example, the lower specification limit for a given characteristic of a product is a critical value that determines whether or not the product meets quality requirements.
This refers to the technique used in representing data in a form useful for generating information. This should be done in a compact and yet meaningful format. The performance of information systems can be greatly improved if effective data formats and coding are designed into the system right from the beginning.
Data processing is the manipulation of data to generate useful information. Different types of information may be generated from a given data set depending on how it is processed. The processing method should consider how the information will be used, who will be using it, and what caliber of system response time is desired. If possible, processing controls should be used.
It checks the completeness of the processing by comparing accumulated results to a known total. An example of this is the comparison of machine throughput to a standard production level or the comparison of cumulative project budget depletion to a cost accounting standard.
It checks if the processing is producing the same results for similar data. For example, an electronic inspection device that suddenly shows a measurement that is ten times higher than the norm warrants an investigation of both the input and the processing mechanisms.
For numeric scales, specify units of measurement, increments, the zero point on the measurement scale, and the range of values.
Using information involves people. Computers can collect data, manipulate data, and generate information, but the ultimate decision rests with people, and decision-making starts when information becomes available. Intuition, experience, training, interest, and ethics are just a few of the factors that determine how people use information. The same piece of information that is positively used to further the progress of a project in one instance may also be used negatively in another instance. To assure that data and information are used appropriately, computer-based security measures can be built into the information system. Project data may be obtained from several sources. Some potential sources are as follows:
The timing of data is also very important for project control purposes. The contents, level of detail, and frequency of data can affect the control process. An important aspect of project management is the determination of the data required to generate the information needed for project control. The function of keeping track of the vast quantity of rapidly changing and interrelated data about project attributes can be very complicated. The major steps involved in data analysis for project control are as follows:
Data is processed to generate information. Information is analyzed by the decision maker to make the required decisions. Good decisions are based on timely and relevant information, which in turn is based on reliable data. Data analysis for project control may involve the following functions:
Proper data management will prevent misuse, misinterpretation, or mishandling. Data is needed at every stage in the life cycle of a project from the problem identification stage through the project phase-out stage. The various items for which data may be needed are project specifications, feasibility study, resource availability, staff size, schedule, project status, performance data, and phase-out plan. The documentation of data requirements should cover the following:
Data availability should be exploited and leverage for pertinent decision-making. Data exploitation refers to the various mathematical and graphical operations that can be performed on data to elicit the inherent information contained in the data. The manner in which project data is analyzed and presented can affect how the information is perceived by the decision maker. The examples presented in this section illustrate how basic data analysis techniques can be used to convey important information for project control.
In many cases, data is represented as the answer to direct questions such as the following: When is the project deadline? Who are the people assigned to the first task? How many resource units are available? Are enough funds available for the project? What are the quarterly expenditures on the project for the past two years? Is personnel productivity low, average, or high? Who is the person in charge of the project? Answers to these types of questions constitute data of different forms or expressed on different scales. The resulting data may be qualitative or quantitative. Different techniques are available for analyzing the different types of data. This section discusses some of the basic techniques for data analysis. The data presented in Table 3.1 is used to illustrate the data analysis techniques.
Project |
Quarter 1 |
Quarter 2 |
Quarter 3 |
Quarter 4 |
Row Total |
---|---|---|---|---|---|
A |
3,000 |
3,200 |
3,400 |
2,800 |
12,400 |
B |
1,200 |
1,900 |
2,500 |
2,400 |
8,000 |
C |
4,500 |
3,400 |
4,600 |
4,200 |
16,700 |
D |
2,000 |
2,500 |
3,200 |
2,600 |
10,300 |
Total |
10,700 |
11,000 |
13,700 |
12,000 |
47,400 |
Raw data consists of ordinary observations recorded for a decision variable or factor. Examples of factors for which data may be collected for decision-making are revenue, cost, personnel productivity, task duration, project completion time, product quality, and resource availability. Raw data should be organized into a format suitable for visual review and computational analysis. The data in Table 3.1 represents the quarterly revenues from projects A, B, C, and D. For example, the data for quarter 1 indicates that project C yielded the highest revenue of $4,500,000, while project B yielded the lowest revenue of $1,200,000. Figure 3.1 presents the raw data of project revenue as a line graph. The same information is presented as a multiple bar chart in Figure 3.2.
Figure 3.1 Line graph of quarterly project revenues.
Figure 3.2 Multiple bar chart of quarterly project revenues.
A total or sum is a measure that indicates the overall effect of a particular variable. If X _{1}, X _{2}, X _{3}, …, X _{n} represent a set of n observations (e.g., revenues), then the total is computed as follows:
For the data in Table 3.1, the total revenue for each project is shown in the last column. The totals indicate that project C brought in the largest total revenue over the four quarters under consideration, while project B produced the lowest total revenue. The last row of the table shows the total revenue for each quarter. The totals reveal that the largest revenue occurred in the third quarter. The first quarter brought in the lowest total revenue. The grand total revenue for the four projects over the four quarters is shown as $47,400,000 in the last cell in the table. The total revenues for the four projects over the four quarters are shown in a pie chart in Figure 3.3. The percentage of the overall revenue contributed by each project is also shown on the pie chart.
Figure 3.3 Pie chart of total revenue per project.
Average is one of the most used measures in data analysis. Given n observations (e.g., revenues), X _{1}, X _{2}, X _{3}, …, X _{n}, the average of the observations is computed as
where T _{x} is the sum of n revenues. For our sample data, the average quarterly revenues for the four projects are
Similarly, the expected average revenues per project for the four quarters are
The above values are shown in a bar chart in Figure 3.4. The average revenue from any of the four projects in any given quarter is calculated as the sum of all the observations divided by the number of observations. That is,
Figure 3.4 Average revenue per project for each quarter.
where
The overall average per project per quarter is
As a cross-check, the sum of the quarterly averages should be equal to the sum of the project revenue averages, which is equal to the grand total divided by 4.
The cross-check procedure above works because we have a balanced table of observations. That is, we have four projects and four quarters. If there were only three projects, for example, the sum of the quarterly averages would not be equal to the sum of the project averages.
The median is the value that falls in the middle of a group of observations arranged in order of magnitude. One-half of the observations are above the median, and the other half are below the median. The method of determining the median depends on whether or not the observations are organized into a frequency distribution. For unorganized data, it is necessary to arrange the data in an increasing or decreasing order before finding the median. Given K observations (e.g., revenues), X _{1}, X _{2}, X _{3}, …, X _{K}, arranged in increasing or decreasing order, the median is identified as the value in position (K + 1)/2 in the data arrangement if K is an odd number. If K is an even number, then the average of the two middle values is considered to be the median. If the sample data are arranged in increasing order, we would get the following:
1,200, 1,900, 2,000, 2,400, 2,500, 2,500, 2,600, 2,800, 3,000, 3,200, 3,200, 3,400, 3,400, 4,200, 4,500, and 4,600
The median is then calculated as (2,800+3,000)/2 = 2,900. Half of the recorded revenues are expected to be above $2,900,000, while half are expected to be below that amount. Figure 3.5 presents a bar chart of the revenue data arranged in increasing order. The median is anywhere between the eighth and ninth values in the ordered data.
Figure 3.5 Ordered bar chart.
The median is a position measure because its value is based on its position in a set of observations. Other measures of position are quartiles and percentiles. There are three quartiles that divide a set of data into four equal categories. The first quartile, denoted Q _{1}, is the value below which one-fourth of all the observations in the data set fall. The second quartile, denoted Q _{2}, is the value below which two-fourths or one-half of all the observations in the data set fall. The third quartile, denoted Q _{3}, is the value below which three-fourths of the observations fall. The second quartile is identical to the median. It is technically incorrect to talk of the fourth quartile because it will imply that there is a point within the data set below which all the data points fall: a contradiction! A data point cannot lie within the range of the observations and at the same time exceed all the observations, including itself.
The concept of percentiles is similar to the concept of quartiles except that reference is made to percentage points. There are 99 percentiles that divide a set of observations into 100 equal parts. The X percentile is the value below which X percent of the data fall. The 99 percentile refers to the point below which 99 percent of the observations fall. The three quartiles discussed previously are regarded as the 25th, 50th, and 75th percentiles. It would be technically incorrect to talk of the 100 percentile. For the purpose of doing performance rating, such as on an examination or a product quality assessment, the higher the percentile of an individual or product, the better. In many cases, recorded data are classified into categories that are not indexed to numerical measures. In such cases, other measures of central tendency or position will be needed. An example of such a measure is the mode.
The mode is defined as the value that has the highest frequency in a set of observations. When the recorded observations can be classified only into categories, the mode can be particularly helpful in describing the data. Given a set of K observations (e.g., revenues), X _{1}, X _{2}, X _{3},…, X _{K}, the mode is identified as that value that occurs more than any other value in the set. Sometimes, the mode is not unique in a set of observations. For example, in Table 3.2, $2,500, $3,200, and $3,400 all have the same number of occurrences. Each of them is a mode of the set of revenue observations. If there is a unique mode in a set of observations, then the data is said to be unimodal. The mode is very useful in expressing the central tendency for observations with qualitative characteristics such as color, marital status, or state of origin.
Observation Number ( i ) |
Recorded Observation X _{i} |
Deviation from Average ${X}_{i}-\overline{X}$ |
Absolute Value $\left|{X}_{i}-\overline{X}\right|$ |
Square of Deviation ${\left({X}_{i}-\overline{X}\right)}^{2}$ |
1 |
3,000 |
37.5 |
37.5 |
1,406.25 |
2 |
1,200 |
−1,762.5 |
1762.5 |
3,106,406.30 |
3 |
4,500 |
1,537.5 |
1537.5 |
2,363,906.30 |
4 |
2,000 |
−962.5 |
962.5 |
926,406.25 |
5 |
3,200 |
237.5 |
237.5 |
56,406.25 |
6 |
1,900 |
−1,062.5 |
1062.5 |
1,128,906.30 |
7 |
3,400 |
437.5 |
437.5 |
191,406.25 |
8 |
2,500 |
−462.5 |
462.5 |
213,906.25 |
9 |
3,400 |
437.5 |
437.5 |
191,406.25 |
10 |
2,500 |
−462.5 |
462.5 |
213,906.25 |
11 |
4,600 |
1,637.5 |
1637.5 |
2,681,406.30 |
12 |
3,200 |
237.5 |
237.5 |
56,406.25 |
13 |
2,800 |
−162.5 |
162.5 |
26,406.25 |
14 |
2,400 |
−562.5 |
562.5 |
316,406.25 |
15 |
4,200 |
1,237.5 |
1237.5 |
1,531,406.30 |
16 |
2,600 |
−362.5 |
362.5 |
131,406.25 |
Total |
47,400.0 |
0.0 |
11,600.0 |
13,137,500.25 |
Average |
2,962.5 |
0.0 |
725.0 |
821,093.77 |
Square root |
— |
— |
— |
906.14 |
The range is determined by the two extreme values in a set of observations. Given K observations (e.g., revenues), X _{1}, X _{2}, X _{3}, …, X _{K}, the range of the observations is simply the difference between the lowest and the highest observations. This measure is useful when the analyst wants to know the extent of extreme variations in a parameter. The range of the revenues in our sample data is ($4,600,000 − $1,200,000) = $3,400,000. Because of its dependence on only two values, the range tends to increase as the sample size increases. Furthermore, it does not provide a measurement of the variability of the observations relative to the center of the distribution. This is why the standard deviation is normally used as a more reliable measure of dispersion than the range.
The variability of a distribution is generally expressed in terms of the deviation of each observed value from the sample average. If the deviations are small, the set of data is said to have low variability. The deviations provide information about the degree of dispersion in a set of observations. A general formula to evaluate the variability of data cannot be based on the deviations. This is because some of the deviations are negative, whereas some are positive and the sum of all the deviations is equal to 0. One possible solution to this is to compute the average deviation.
The average deviation is the average of the absolute values of the deviations from the sample average. Given K observations (e.g., revenues), X _{1}, X _{2}, X _{3}, …, X _{K}, the average deviation of the data is computed as
Table 3.2 shows how the average deviation is computed for our sample data. One aspect of the average deviation measure is that the procedure ignores the sign associated with each deviation. Despite this disadvantage, its simplicity and ease of computation make it useful. In addition, the knowledge of the average deviation helps in understanding the standard deviation, which is the most important measure of dispersion available.
Sample variance is the average of the squared deviations computed from a set of observations. If the variance of a set of observations is large, the data is said to have a large variability. For example, a large variability in the levels of productivity of a project team may indicate a lack of consistency or improper methods in the project functions. Given K observations (e.g., revenues), X _{1}, X _{2}, X _{3}, …, X _{K}, the sample variance of the data is computed as
The variance can also be computed by the following alternate formulas:
Using the first formula, the sample variance of the data in Table 3.2 is calculated as
The average calculated in the last column of Table 3.1 is obtained by dividing the total for that column by 16 instead of 16 − 1 = 15. That average is not the correct value of the sample variance. However, as the number of observations gets very large, the average as computed in the table will become a close estimate for the correct sample variance. Analysts make a distinction between the two values by referring to the number calculated in the table as the population variance when K is very large and referring to the number calculated by the formulas above as the sample variance particularly when K is small. For our example, the population variance is given by
while the sample variance, as shown previously for the same data set, is given by
The sample standard deviation of a set of observations is the positive square root of the sample variance. The use of variance as a measure of variability has some drawbacks. For example, the knowledge of the variance is helpful only when two or more sets of observations are compared. Because of the squaring operation, the variance is expressed in square units rather than the original units of the raw data. To get a reliable feel for the variability in the data, it is necessary to restore the original units by performing the square root operation on the variance. This is why standard deviation is a widely recognized measure of variability. Given K observations (e.g., revenues), X _{1}, X _{2}, X _{3},…, X _{K}, the sample standard deviation of the data is computed as
As in the case of the sample variance, the sample standard deviation can also be computed by the following alternate formulas:
Using the first formula, the sample standard deviation of the data is calculated as
We can say that the variability in the expected revenue per project per quarter is $935,859.70. The population sample standard deviation is given by the following:
The sample standard deviation is given by the following expression:
The results of data analysis can be reviewed directly to determine where and when project control actions may be needed. The results can also be used to generate control charts, as illustrated in Chapter 1 for my high school course grades.