Understanding Your Data: A Friendly Guide to Descriptive Statistics

July 12, 2025 | ai | statistics | data-analysis |

Ever wondered how to make sense of a jumble of numbers? That’s where descriptive statistics come in! Think of descriptive statistics as your data’s personal storyteller. It helps you summarize, organize, and simplify complex datasets without delving into complex conclusions or predictions. It’s the first crucial step in any data analysis journey, giving you a clear picture of what you’re working with.

The ABCs of Data: Data Types

Before we start crunching numbers, it’s essential to understand the different flavors of data we might encounter. Just like a chef needs to know if they’re working with fruits or vegetables, a data analyst needs to distinguish between data types.

Quantitative vs. Categorical Data

Imagine you’re collecting information about students in a classroom.

  • Quantitative data is all about numbers – things you can count or measure. It’s like asking “How tall are you?” or “How many books do you have?”. These are numerical values that have a meaningful order and can be used in mathematical calculations.
  • Categorical data, on the other hand, describes qualities or characteristics. It’s like asking “What’s your favorite color?”. These are categories or labels, and while you might assign numbers to them (like 1 for red, 2 for blue), the numbers themselves don’t have mathematical meaning. You can’t add “red” and “blue” together!

Let’s use our classroom example:

Quantitative Data Categorical Data
Student’s height Favorite color
Number of books Gender
Test scores Hair color

Continuous vs. Discrete Data

Now, let’s dive deeper into quantitative data. It can be further broken down into two types:

  • Continuous data can take any value within a given range. Think of it like a dimmer switch for a light – you can have infinitely many light levels between fully off and fully on. Examples include height (you could be 1.75 meters, 1.753 meters, etc.) or temperature (25 degrees Celsius, 25.1 degrees, 25.123 degrees).
  • Discrete data can only take specific, distinct values. It’s like a light switch that’s either on or off – no in-between. Examples include the number of students in a class (you can’t have 25.5 students) or the number of cars in a parking lot.

Using our classroom example again, let’s consider the test scores:

  • If test scores are percentages (e.g., 85.5%, 92.75%), they could be considered continuous if the scoring allows for infinite decimal places.
  • If test scores are whole numbers (e.g., 85, 92, 93), they would be discrete.

Where’s the Middle Ground? Measures of Center

Once you have your data, you’ll often want to know what the “typical” value is. This is where measures of center come in handy. They tell you where the bulk of your data lies.

Let’s imagine our classroom of 10 students took a 10-point quiz. Their scores are: 7, 8, 8, 9, 7, 6, 10, 8, 9, 7.

Mean (Average)

The mean is what most people think of when they hear “average.” You sum up all the values and divide by the number of values. It’s like distributing all the quiz points equally among the students.

\[\text{Mean} = \frac{\text{Sum of all values}}{\text{Number of values}}\]

For our quiz scores:

\[\text{Mean} = \frac{7+8+8+9+7+6+10+8+9+7}{10} = \frac{79}{10} = 7.9\]

So, the average quiz score is 7.9.

Median

The median is the middle value in a dataset when it’s ordered from least to greatest. If you have an even number of values, the median is the average of the two middle values. Think of it as finding the student right in the middle if they lined up by their scores.

First, let’s order our quiz scores: 6, 7, 7, 7, 8, 8, 8, 9, 9, 10.

Since we have 10 scores (an even number), the median is the average of the 5th and 6th scores: (8 + 8) / 2 = 8.

The median quiz score is 8.

Mode

The mode is the value that appears most frequently in your dataset. It’s like finding the most common quiz score.

In our ordered quiz scores (6, 7, 7, 7, 8, 8, 8, 9, 9, 10), both 7 and 8 appear three times. When there are two modes, the dataset is called bimodal.


How Spread Out Is It? Measures of Spread

Knowing the center of your data is great, but it doesn’t tell the whole story. Imagine two classes with an average quiz score of 8. In one class, everyone scored exactly 8. In another, scores ranged from 2 to 10. Both have the same mean, but very different levels of consistency. This is where measures of spread come in, showing you how dispersed or varied your data is.

Let’s continue with our quiz scores: 6, 7, 7, 7, 8, 8, 8, 9, 9, 10.

Range

The simplest measure of spread is the range, which is the difference between the highest and lowest values in your dataset.

\[\text{Range} = \text{Maximum Value} - \text{Minimum Value}\]

For our quiz scores: Range = 10 - 6 = 4.

Five-Number Summary

The five-number summary gives you a quick snapshot of the data’s distribution. It consists of:

  1. Minimum: The smallest value in the dataset (6).
  2. First Quartile (Q1): The median of the lower half of the data. For our scores (6, 7, 7, 7, 8, 8, 8, 9, 9, 10), the lower half is (6, 7, 7, 7, 8). The median of this is 7.
  3. Median (Q2): The middle value of the entire dataset (which we found to be 8).
  4. Third Quartile (Q3): The median of the upper half of the data. The upper half is (8, 8, 9, 9, 10). The median of this is 9.
  5. Maximum: The largest value in the dataset (10).

So, our five-number summary is: (6, 7, 8, 9, 10). This can be visually represented by a box plot.

Standard Deviation and Variance

The standard deviation and variance are perhaps the most common and powerful measures of spread. They tell you, on average, how far each data point is from the mean.

Imagine you have a target, and your mean is the bullseye. Standard deviation tells you how scattered your arrows are around that bullseye. A small standard deviation means the arrows are clustered tightly around the bullseye (data points are close to the mean), while a large standard deviation means they’re all over the place (data points are widely dispersed).

The variance is simply the standard deviation squared. It’s often used in statistical calculations, but the standard deviation is more intuitive for interpretation because it’s in the same units as your original data.

Let’s calculate the standard deviation for our quiz scores (mean = 7.9):

Step 1: Calculate the difference from the mean for each data point:

  • 6 - 7.9 = -1.9
  • 7 - 7.9 = -0.9
  • 7 - 7.9 = -0.9
  • 7 - 7.9 = -0.9
  • 8 - 7.9 = 0.1
  • 8 - 7.9 = 0.1
  • 8 - 7.9 = 0.1
  • 9 - 7.9 = 1.1
  • 9 - 7.9 = 1.1
  • 10 - 7.9 = 2.1

Step 2: Square each of these differences:

\[\begin{aligned} (-1.9)^2 &= 3.61 \\ (-0.9)^2 &= 0.81 \\ (-0.9)^2 &= 0.81 \\ (-0.9)^2 &= 0.81 \\ (0.1)^2 &= 0.01 \\ (0.1)^2 &= 0.01 \\ (0.1)^2 &= 0.01 \\ (1.1)^2 &= 1.21 \\ (1.1)^2 &= 1.21 \\ (2.1)^2 &= 4.41 \end{aligned}\]

Step 3: Sum the squared differences:

\[3.61 + 0.81 + 0.81 + 0.81 + 0.01 + 0.01 + 0.01 + 1.21 + 1.21 + 4.41 = 12.9\]

Step 4: Divide by (n-1) for sample standard deviation.

(We use n-1 to get an unbiased estimate of the population standard deviation, which is more common in practice.)

\[\text{Variance} = \frac{12.9}{10-1} = \frac{12.9}{9} \approx 1.433\]

Step 5: Take the square root of the variance to get the standard deviation:

\[\text{Standard Deviation} = \sqrt{1.433} \approx 1.197\]

So, on average, a student’s quiz score deviates by about 1.2 points from the mean score of 7.9.

Let’s see how we can calculate this quickly in Python:

import numpy as np

quiz_scores = [7, 8, 8, 9, 7, 6, 10, 8, 9, 7]

# Mean
mean_score = np.mean(quiz_scores)
print(f"Mean score: {mean_score:.2f}")

# Median
median_score = np.median(quiz_scores)
print(f"Median score: {median_score:.2f}")

# Mode
from collections import Counter
data = Counter(quiz_scores)
mode_score = [k for k, v in data.items() if v == max(data.values())]
print(f"Mode score(s): {mode_score}")

# Range
data_range = np.max(quiz_scores) - np.min(quiz_scores)
print(f"Range of scores: {data_range}")

# Five-number summary
q1 = np.percentile(quiz_scores, 25)
q3 = np.percentile(quiz_scores, 75)
print(f"Five-number summary (Min, Q1, Median, Q3, Max): "
      f"({np.min(quiz_scores)}, {q1}, {median_score}, {q3}, {np.max(quiz_scores)})")

# Variance (sample variance)
variance_score = np.var(quiz_scores, ddof=1) # ddof=1 for sample variance
print(f"Variance of scores: {variance_score:.2f}")

# Standard Deviation (sample standard deviation)
std_dev_score = np.std(quiz_scores, ddof=1) # ddof=1 for sample standard deviation
print(f"Standard deviation of scores: {std_dev_score:.2f}")

This journey through descriptive statistics helps you transform raw data into meaningful insights. By understanding data types, measures of center, and measures of spread, you’re well-equipped to tell the story your data has to tell.

Edit this post on GitHub

Leia também