The greater the array of the analyzed data, the harder it is to focus on their main characteristics. In order to better absorb the information contained in the data set, they must be organized properly. To do this, the ordered array or stem and leaf plot is used. In addition to stem and leaf plot, there are several classic data structures that allow you to store a large amount of required information and provide the ability to quickly retrieve and change it. Which of these structures (stem and leaf plot or something else) is the most effective for a specific task is defined by a set of queries (and their relative frequency) to modify and retrieve information.
Below we’ll try to evaluate the effectiveness of several of these classic data structures, including stem and leaf plot.
The simplest option is to keep all the records in memory (or on disk) one after another.
The new record will be added to the end of the array, which will be done for a specific period of time O(1), that is, a finite small time period that doesn’t depend from N.
However the functions of search and delete will be long. The only way to find the desired item in a disordered array is to sequentially sort out its elements and compare key elements with the key we need (in our case, the key is a last name; key is a field, which is used for the search; usually the key is a unique identifier of the recording). So, until you find the necessary record, the algorithm will search over an average of half the records, that is, the time cost will be about N/2. It often happens that the thing is being looked for in the archive is not there, and the algorithm will have to search across an array of records spending time N. The estimate speed of the computer is as follows: in one second is viewed 1 million records.
If you have 100 million records, the search operation may take 100 seconds, and that's a lot.
The same applies to the delete function. To delete, first we have to find an extracting record. Then you need to move the records that go after it on one unit to the left to remove the formed hole. If the first element of the array is deleted, it is necessary to move to the second to the place of the first, the third to the position of the second, and so on, overall you’ll need to make N-1 movements. In average, there will be N/2 movements, and they are more time-consuming than reading.
For us the important thing is that the time for search for and delete is growing linearly with the N. The coefficient before N is not so important, so we do not write it (and we actually can’t calculate it without having a specific machine and the implementation of the algorithm). The next one is stem and leaf plot.
Stem and leaf plot is the representation of the data samples measured in an interval scale. The stem and leaf plot was invented by John Tukey. It is often used in exploratory data analysis to illustrate the essential characteristics of the distribution of data in a convenient and easy-to-read format.
Stem and leaf plot is similar to a histogram, but it is usually more informative for relatively small sets of data (<100 points). In addition to the chart, there is a table in stem and leaf plot, which makes it easier to write data in the order of changing their values, which can be useful for many statistical procedures.
We can compare different sets of data through multiple stem and leaf plots. Using adjacent plots we can compare the values of the same characteristics in paired samples, for example, smokers’ and non-smokers’ heart rate after exercises.
Stem and leaf plot constitutes a combination of bar graphs and tabular list. As on histogram, the length of each line corresponds to the number of observations that fall under a certain interval. In addition to this, the stem and leaf plot shows the numerical value for each observation. For this purpose, the numerical value is divided into two components - stem with branches, each of which represents the first digit or a group of digits, and leaf that represents the next digits. Stem is consistent with those ranks of observed numerical values that don’t change. The leaves correspond with the ranks that change within the chosen interval.
Stem and leaf plot is a tool for visual organization of data acquisition and analysis of their distribution. The data in the plot is distributed in accordance with the first digits (stems), and trailing digits (leaves). For example, the number 18.9 in the stem and leaf plot includes the stem 18 and the leaf 9.
Unfortunately, Excel does not automatically build the stem and leaf plot. Therefore, it is required to construct stem and leaf plot manually. As the stem let’s use the whole part of the temperature and as leaves let’s use a decimal part.
Stem and leaf plot visualizes a large array of information. For example, it is possible to directly determine the minimum temperature and the maximum temperature on stem and leaf plot. It is evident that most of the values fall into the range of 16... 20°C, and the actual values form a normal distribution with an average value of about 18°C. Also, there is a fairly wide tail in the big values when using stem and leaf plot.
The idea of the ordered array is if the records are ordered, it is easier to search for them. Indeed, if the last names are in alphabetical order, then it is much easier to find the right last name: look in the middle of the list, and see where the necessary last name is – below or above. In the right part, we look again at the middle and see where we have to point our eyes – downwards or upwards. This search method is called search by dividing in half (binary search, method of division in half, dichotomy).
Now let's see how much time we need for operations to add and remove entries.
Removal from an ordered array will be faster, since we will find the deleted item faster. But then again, you need to perform an average of N/2 operations of shifting the elements to remove the resulting hole in the array, so the asymptotics of an average period of time of removal operations will be the same – O(N).
The operation of adding an element in an ordered array is too time-consuming. When adding we want to keep the property of ordering. We can’t just add the item to the end of the array. We need to find a place for it in the array, and then push it to make a way for the new element (that is, to move items for one to the right from the last element). On average, we will need to make N/2 shifts, which means asymptotics is O(N).
Thus, sorting the array, we have not received a significant improvement: the search has become faster, but the average time for adding a new record has greatly increased.
Conclusion: storing data in an ordered array is effective only if they do not change, that is, when delete and add queried don’t occur.
The idea of this diagram is that we spend a lot of time in arrays on moving the tail of an array. This can be avoided by using lists.
List is a structure of data for storing sequence of elements. The items of the list are located not strictly to one another in the memory, but randomly. The sequence is built up due to the fact that every element of the list knows (contains information) about the fact, where the next and previous item in the list is located in the memory.
List is called doubly connected if every element contains information about the places where the next and previous item is located in the list. If the element contains information only about the next item, then the list is called simply connected.
To insert an element inside a simply connected list, you need to break a single arrow, and add two new ones (for a doubly linked list, you need to break two arrows and add four of yours).
We can navigate through the items back and forth in the doubly linked list, but we can’t quickly move to the middle of the list. In order to get to the element N/2, you need to move N/2 to the next item beginning with the first.
Therefore, there is no necessity in creating an ordered list, because we will not be able to search by dividing in half.
So, the operation of adding and removal an item from the list takes O(1) time. The search of the element, as in disordered array, takes an average of O(N) time.
Pareto chart is a bar chart, the columns of which correspond to different values of some categorical variable. The height of each column represents the frequency of occurrence of the relevant values, and the columns are arranged in descending order of frequency. Besides, the diagram also includes the polygonal line of cumulative percentage that allows determining the total frequency, expressed in percentages, ad two, three, etc. of the most common values of the categorical variable.
Application of Pareto chart is reasonable and effective, for example, in tasks related to quality issues. Let us assume that we study a group of substandard components of something and classify each unit according to the cause of the defect. In this case, the Pareto chart shows various causes of defects, ordered from most to least frequently occurring (bar graph) and, in addition, the percentage of defects caused by two, three, four, etc. of the most common reasons.