Week 5: Descriptive Statistics

KIN 610 - Spring 2023

Dr. Ovande Furtado Jr

Credits

(furtadoDescriptiveStatistics2023?)

Descriptive Stats

Measures of central tendency
Measures of variability
Tables and graphs

Symbols

Measure	Symbol
Mean (population)	\(\mu\)
Mean (sample)	\(\bar{x}\)
Median	\(med\)
Mode	\(mode\)
Range	\(R\)
Interquartile Range	\(IQR\)
Variance (population)	\(\sigma^2\)
Variance (sample)	\(s^2\)
Standard Deviation (population)	\(\sigma\)
Standard Deviation (sample)	\(s\)

Measures of Central Tendency

The Mean

The mean is the most commonly used measure of central tendency in Kinesiology research.
It is the arithmetic average of a data set.
Calculated by adding up all the values and dividing by the total number of values.

Why Mean?

The mean considers all the values in a data set.
Sensitive to small changes in the data.
Appropriate for continuous (normal) data

Limitations of Mean

Not always the most appropriate measure of central tendency.
Extreme outliers or skewed data can influence the mean.

library(stats)

# Generate a gamma distributed random variable
x <- rgamma(n = 1000, shape = 2, rate = 1)

# Create a skewed distribution by taking the square root of the gamma variable
y <- sqrt(x)

# Plot the histogram of the skewed distribution
hist(y, breaks = 20, col = "lightblue", main = "Skewed Distribution")

Median as an alternative

If data are continuous but deviating from normality
The median is not as sensitive to extreme values as the mean.
Appropriate with skewed data or data with outliers.

set.seed(123)  # for reproducibility

# Generate 1000 samples from a chi-squared distribution with 2 degrees of freedom
x <- rchisq(1000, df = 2)

# Plot the histogram of x
hist(x, breaks = 20, col = "skyblue", main = "Badly skewed distribution")

Calculating the Mean - Steps

Add up all the values in the dataset.
Count the number of observations in the dataset.
Divide the total sum by the number of observations.

Calculating the Mean - Example

Data set: 12.5, 10.8, 11.2, 13.1, 12.9, 11.7, 12.3

Step 1: 12.5 + 10.8 + 11.2 + 13.1 + 12.9 + 11.7 + 12.3 = 84.5
Step 2: 7
Step 3: 84.5 / 7 = 12.07 seconds

Calculating the Mean - Equation

\[ \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i \]

where, \(n\) is the total number of values in the set, \(x_i\) is the \(i\)th value in the set, and \(\sum_{i=1}^{n}\) represents the sum of all the values from \(i=1\) to \(i=n\).

Displaying data¹

library(ggplot2)

# Create a data frame with 3 group means
group_means <- data.frame(group = c("Group 1", "Group 2", "Group 3"),
                          mean = c(10, 15, 12))

# Create a bar chart with custom colors and labels
ggplot(group_means, aes(x = group, y = mean, fill = group)) +
  geom_col() +
  scale_fill_manual(values = c("#FFA07A", "#87CEEB", "#90EE90")) +  # custom colors
  labs(title = "Comparison of Group Means in Kinesiology",
       x = "Groups", y = "Mean Values")

Median

Definition of Median
Median is not influenced by extreme values
Median may not be as sensitive to small changes in the data as the mean
Both case below the median is 7
data1: 1, 3, 7, 7, 8, 99
data2: 1, 3, 7, 7, 8, 9

Median Calculation

First step is arranging the data in order of magnitude
Median is the middle value if the number of observations is odd
Median is the average of two middle values if the number of observations is even
Use of median when data contains extreme values or is ordinal
Helps to understand the group’s typical value or performance
Example in the next slide

Mode

Definition of Mode
Mode is the value that occurs most frequently in a data set
Mode is often used with categorical or nominal data

Mode Calculation

Identify the value or category that occurs most frequently in the data set
Helps researchers identify the most common value or category in a data set

Comparison table

Measure of Central Tendency	Definition	Calculation	Usefulness
Mean	The average of a set of numbers	Sum of values divided by number of values	Useful for data that are normally distributed and have no extreme values
Median	The middle value in a set of numbers	Order values and find the middle value	Useful for data with extreme values or that are not normally distributed
Mode	The value that occurs most frequently in a set of numbers	Identify the value that appears most often	Useful for categorical or nominal data

Measures of Variability

Introduction

Understanding how much individual data points in a data set vary from one another
Types: variance, standard deviation, range, and interquartile range
Importance
- understanding data sets
- can help researchers understand the precision and accuracy of their results
- to draw meaningful conclusions

Range

Difference between the largest and smallest values in a data set
Limitations of range due to sensitivity to outliers or extreme values
Caution in interpreting the range especially when there are outliers
The need to use range in conjunction with other measures of variability

Calculating Range

Use of jamovi to obtain the minimum and maximum values for a data set
Calculation of the range by subtracting the minimum from the maximum value

Interquartile Range

Definition of interquartile range as a measure of variability that is less sensitive to outliers
Use of quartiles to divide a data set into four equal parts
Calculation of interquartile range as the difference between the upper and lower quartiles
Importance of interquartile range in providing information about the range of the middle 50% of the data

Calculating Interquartile Range

Calculation: Q3 - Q1
Interpretation: the range between the first quartile and the third quartile
Advantages: Resistant to outliers
Disadvantages: Not sensitive to extreme values that fall outside the range of the interquartile

Variance

\(s^2 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1}\)

Interpretation: A measure of how much the data deviates from the mean
Advantages: Widely used and well known
Disadvantages: Can be sensitive to outliers

Standard Deviation

Calculation: \(s = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}}\)
Interpretation: A measure of the amount of variation or dispersion of a set of values from the mean
Advantages: Widely used and well known
Disadvantages: Can be sensitive to outliers; more difficult to interpret than the range or IQR

Coefficient of Variation

Calculation: CV = \(\frac{s}{\bar{x}} \times 100%\)
Interpretation: A measure of the relative variation or dispersion of a data set, particularly useful for comparing the variability of data sets with different units or scales
Advantages: Allows for comparison of the relative variability of data sets with different scales or units
Disadvantages: Not suitable for data sets with negative or zero mean values

Comparing Measures of Variability

It is essential to consider the characteristics of the data and the research question when comparing measures of variability.
Range and IQR are useful for non-normally distributed data or when identifying outliers.
Variance and standard deviation are useful for normally distributed data and can provide more information about the spread of the distribution.
Coefficient of variation is suitable for comparing the spread of two sets of data with different units.

Comparison table

Measure of Variability	Calculation	Interpretation	Advantages	Disadvantages
Range	Maximum value - Minimum value	The spread of the data from the smallest to the largest value	Easy to understand	Sensitive to outliers
Interquartile Range	Q3 - Q1	The range between the first quartile and the third quartile	Resistant to outliers	Not sensitive to extreme values that fall outside the range of the interquartile
Variance	\(s^2 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1}\)	A measure of how much the data deviates from the mean	Widely used and well known	Can be sensitive to outliers
Standard Deviation	\(s = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}}\)	A measure of the amount of variation or dispersion of a set of values from the mean	Widely used and well known	Can be sensitive to outliers; more difficult to interpret than the range or IQR

Using jamovi

Open jamovi, click Exploration, then Descriptives
Move DVs under Variables and IVs under Split by
Select Variables across rows under Descriptives (horizontal format)
jamovi will create a Descriptives Table - see next slide.

Descriptives Table

Summary

Descriptive statistics help in summarizing and describing a dataset’s features.
Measures of variability are used to understand how spread out the data is.
The range, interquartile range, variance, standard deviation, and coefficient of variation are measures of variability that have their advantages and disadvantages.
The choice of measure depends on the characteristics of the data and the research question.

Practice Exercises

Find several exercises by clicking here

Week 5: Descriptive Statistics

Credits

Descriptive Stats

Symbols

Measures of Central Tendency

The Mean

Why Mean?

Limitations of Mean

Median as an alternative

Calculating the Mean - Steps

Calculating the Mean - Example

Calculating the Mean - Equation

Displaying data1

Median

Median Calculation

Mode

Mode Calculation

Comparison table

Measures of Variability

Introduction

Range

Calculating Range

Interquartile Range

Calculating Interquartile Range

Variance

Standard Deviation

Coefficient of Variation

Comparing Measures of Variability

Comparison table

Using jamovi

Descriptives Table

Summary

Practice Exercises

References

Displaying data¹