 # Descriptive statistics

## Descriptive statistics Topics

Sort by:

### Frequency polygon

A distribution of values of a discrete variate represented graphically by plotting points , , ..., , and drawing a set of straight line segments connecting adjacent points. It is usually preferable to use a histogram for grouped distributions.

### Class

The word "class" has many specialized meanings in mathematics in which it refers to a group of objects with some common property (e.g., characteristic class or conjugacy class.)In statistics, a class is a grouping of values by which data is binned for computation of a frequency distribution (Kenney and Keeping 1962, p. 14). The range of values of a given class is called a class interval, the boundaries of an interval are called class limits, and the middle of a class interval is called the class mark.The following table summarizes the classes illustrated in the histogramabove for an example data set.class intervalclass markabsolute frequencyrelative frequencycumulative absolute frequencyrelative cumulative frequency0.00- 9.99510.0110.0110.00-19.991530.0340.0420.00-29.992580.08120.1230.00-39.9935180.18300.3040.00-49.9945240.24540.5450.00-59.9955220.22760.7660.00-69.9965150.15910.9170.00-79.997580.08990.9980.00-89.998500.00990.9990.00-99.999510.011001.00..

### Zipf's law

In the English language, the probability of encountering the th most common word is given roughly by for up to 1000 or so. The law breaks down for less frequent words, since the harmonic series diverges. Pierce's (1980, p. 87) statement that for is incorrect. Goetz states the law as follows: The frequency of a word is inversely proportional to its statistical rank such thatwhere is the number of different words.

### Percentile

The th percentile is that value of , say , which corresponds to a cumulative frequency of , where is the sample size.

### Outlier

An outlier is an observation that lies outside the overall pattern of a distribution (Moore and McCabe 1999). Usually, the presence of an outlier indicates some sort of problem. This can be a case which does not fit the model under study, or an error in measurement.Outliers are often easy to spot in histograms. Forexample, the point on the far left in the above figure is an outlier.A convenient definition of an outlier is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile.Outliers can also occur when comparing relationships between two sets of data. Outliers of this type can be easily identified on a scatter diagram.When performing least squares fitting to data, it is often best to discard outliers before computing the line of best fit. This is particularly true of outliers along the direction, since these points may greatly influence the result...

### Cumulative frequency

Let the absolute frequencies of occurrence of an event in a number of class intervals be denoted , , .... The cumulative frequency corresponding to the upper boundary of any class interval in a frequency distribution is the total absolute frequency of all values less than that boundary, denoted

### Statistical range

The term "range" has two completely different meanings in statistics.Given order statistics , , ..., , , the range of the random sample is defined by(1)(Hogg and Craig 1995, p. 152).For small samples, the range is a good estimator of the population standarddeviation (Kenney and Keeping 1962, pp. 213-214).For a continuous uniform distribution(2)the distribution of the range is given by(3)This is illustrated above for and values of from (red) to (violet).Given two samples with sizes and and ranges and , let . Then(4)The mean is(5)and the mode is(6)(Kenney and Keeping 1962).

### Midrange

Given order statistics , , ..., , with sample size , the midrange of the random sample is defined by(Hogg and Craig 1995, p. 152).

### Contingency table

A contingency table, sometimes called a two-way frequency table, is a tabular mechanism with at least two rows and two columns used in statistics to present categorical data in terms of frequency counts. More precisely, an contingency table shows the observed frequency of two variables, the observed frequencies of which are arranged into rows and columns. The intersection of a row and a column of a contingency table is called a cell.gendercupconesundaesandwichothermale5923002042480female4103351802055For example, the above contingency table has two rows and five columns (not counting header rows/columns) and shows the results of a random sample of adults classified by two variables, namely gender and favorite way to eat ice cream (Larson and Farber 2014). One benefit of having data presented in a contingency table is that it allows one to more easily perform basic probability calculations, a feat made easier still by augmenting a summary..

### Lorenz curve

The Lorenz curve is used in economics and ecology to describe inequality in wealth or size. The Lorenz curve is a function of the cumulative proportion of ordered individuals mapped onto the corresponding cumulative proportion of their size. Given a sample of ordered individuals with the size of individual and , then the sample Lorenz curve is the polygon joining the points , where , 1, 2, ..., , and . Alternatively, the Lorenz curve can be expressed aswhere is the cumulative distribution function of ordered individuals and is the average size.If all individuals are the same size, the Lorenz curve is a straight diagonal line, called the line of equality. If there is any inequality in size, then the Lorenz curve falls below the line of equality. The total amount of inequality can be summarized by the Gini coefficient (also called the Gini ratio), which is the ratio between the area enclosed by the line of equality and the Lorenz curve, and the total triangular..

### Lorenz asymmetry coefficient

The Lorenz asymmetry coefficient is a summary statistic of the Lorenz curve that measures the degree of asymmetry of a Lorenz curve. The Lorenz asymmetry coefficient is defined as(1)where the functions and are defined as for the Lorenz curve. If , then the point where the Lorenz curve is parallel with the line of equality is above the axis of symmetry. Correspondingly, if , then the point where the Lorenz curve is parallel to the line of equality is below the axis of symmetry.The sample statistic can be calculated from ordered size data using the following equations(2)(3)(4)where is the number of individuals with a size less than .

### Interquartile range

Divide a set of data into two groups (high and low) of equal size at the statistical median if there is an even number of data points, or two groups consisting of points on either side of the statistical median itself plus the statistical median if there is an odd number of data points. Find the statistical medians of the low and high groups, denoting these first and third quartiles by and . The interquartile range is then defined by

### Running maximum

Given a sequence of values , the running maxima are the sequence of values . So, for example, given a sequence , the running maxima are . The unique values of the running maximum are sometimes known as high-water marks, so the high water marks for the above sequence are , which occur at , 2, 3, 4, and 8.

### Reversion to the mean

Reversion to the mean, also called regression to the mean, is the statistical phenomenon stating that the greater the deviation of a random variate from its mean, the greater the probability that the next measured variate will deviate less far. In other words, an extreme event is likely to be followed by a less extreme event.Although this phenomenon appears to violate the definition of independent events, it simply reflects the fact that the probability density function of any random variable , by definition, is nonnegative over every interval and integrates to one over the interval . Thus, as you move away from the mean, the proportion of the distribution that lies closer to the mean than you do increases continuously. Formally,for .The Season 1 episode "Sniper Zero" (2005) of the television crime drama NUMB3RS mentions regression to the mean. ..

### Bowley skewness

The Bowley skewness, also known as quartile skewness coefficient, is defied bywhere the s denote the interquartile ranges. It is implemented in the Wolfram Language as QuartileSkewness[data].

### Hinge

The upper and lower hinges are descriptive statistics of a set of data values, where is of the form with , 1, 2, .... The hinges are obtained by ordering the data in increasing order , ..., , and writing them out in the shape of a "w" as illustrated above. The values at the bottom legs are called the hinges and (and the central peak is the statistical median). In this ordering,(1)(2)(3)For of the form , the hinges and are identical to the quartiles and . The difference is called the H-spread.

### Gini coefficient

The Gini coefficient (or Gini ratio) is a summary statistic of the Lorenz curve and a measure of inequality in a population. The Gini coefficient is most easily calculated from unordered size data as the "relative mean difference," i.e., the mean of the difference between every possible pair of individuals, divided by the mean size ,(Dixon et al. 1987, Damgaard and Weiner 2000). Alternatively, if the data is ordered by increasing size of individuals, is given by(Dixon et al. 1988, Damgaard and Weiner 2000), correcting the typographicalerror in the denominator given in the original paper (Dixon et al. 1987).The Gini coefficient ranges from a minimum value of zero, when all individuals are equal, to a theoretical maximum of one in an infinite population in which every individual except one has a size of zero. It has been shown that the sample Gini coefficients defined above need to be multiplied by in order to become unbiased estimators..

### Benford's law

A phenomenological law also called the first digit law, first digit phenomenon, or leading digit phenomenon. Benford's law states that in listings, tables of statistics, etc., the digit 1 tends to occur with probability , much greater than the expected 11.1% (i.e., one digit out of 9). Benford's law can be observed, for instance, by examining tables of logarithms and noting that the first pages are much more worn and smudged than later pages (Newcomb 1881). While Benford's law unquestionably applies to many situations in the real world, a satisfactory explanation has been given only recently through the work of Hill (1998).Benford's law was used by the character Charlie Eppes as an analogy to help solve a series of high burglaries in the Season 2 "The Running Man" episode (2006) of the television crime drama NUMB3RS.Benford's law applies to data that are not dimensionless, so the numerical values of the data depend on the units. If there..

### Quartile variation coefficient

where and are the first and third quartiles and is the interquartile range.

### Gauss's inequality

If a distribution has a single mode at , thenwhere

### Quartile deviation

where and are the first and third quartiles and is the interquartile range.

### Quartile

One of the four divisions of observations which have been grouped into four equal-sized sets based on their statistical rank. The quartile including the top statistically ranked members is called the first quartile and denoted . The other quartiles are similarly denoted , , and . For data points with of the form (for , 1, ...), the hinges are identical to the first and third quartiles.The following table summarizes a number of common methods for computing the position of the first and third quartiles from a sample size (P. Stikker, pers. comm., Jan. 24, 2005). In the table, denotes the nearest integer function.method1st quartile1st quartile3rd quartile3rd quartile odd even odd evenMinitabTukey (Hoaglin et al. 1983)Moore and McCabe (2002)Mendenhall and Sincich (1995)Freund and Perles (1987)..

### Quantile

The word quantile has no fewer than two distinct meanings in probability. Specific elements in the range of a variate are called quantiles, and denoted (Evans et al. 2000, p. 5). This particular meaning has close ties to the so-called quantile function, a function which assigns to each probability attained by a certain probability density function a value defined by(1)The th -tile is that value of , say , which corresponds to a cumulative frequency of (Kenney and Keeping 1962). If , the quantity is called a quartile, and if , it is called a percentile.A parametrized version of quantile is implemented as Quantile[list, q, a, b, c, d], which returns(2)where is the th order statistic, is the floor function, is the ceiling function, is the fractional part, and(3)There are a number of slightly different definitions of the quantile that are in common use, as summarized in the following table.#plotting positiondescriptionQ10010inverted empirical..

### Frequency distribution

The tabulation of raw data obtained by dividing it into classes of some size and computing the number of data elements (or their fraction out of the total) falling within each pair of class boundaries. The following table shows the frequency distribution of the data set illustrated by the histogram below.class intervalclass markabsolute frequencyrelative frequencycumulative absolute frequencyrelative cumulative frequency0.00- 9.99510.0110.0110.00-19.991530.0340.0420.00-29.992580.08120.1230.00-39.9935180.18300.3040.00-49.9945240.24540.5450.00-59.9955220.22760.7660.00-69.9965150.15910.9170.00-79.997580.08990.9980.00-89.998500.00990.9990.00-99.999510.011001.00