Plotting data and interpreting it¶

Today we will try plotting some data of different measurement types and try to understand when the different plots are informative and what they tell us about how the data is distributed

We will examine how we can use descriptive statistics to describe the distribution. In particular, how can we qunatify a typical value? And how can we quantify how much the distribution deviates from these typical values?

First we will import some modules that we will use...

In [1]:
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

Next we should import some data.

Mini cup Standard Cornet Big cone

This example is adapted from Chapter 1 of Pirates, Peaches and p-values

In [2]:
url = 'https://piratepeel.github.io/GlobalStudiesQuantMethodsS1/data/ice_cream_sales.csv'
iceCreamData = pd.read_csv(url)
iceCreamData.head(10)
Out[2]:
Flavour Cone Temperature Scoops
0 peach giant cone -0.4 3
1 vanilla mini cup -2.2 1
2 vanilla giant cone -2.0 4
3 peach standard cornet 0.4 2
4 peach standard cornet -1.9 2
5 peach mini cup -1.4 2
6 chocolate standard cornet -2.8 3
7 peach standard cornet -2.5 2
8 peach giant cone -0.3 8
9 vanilla standard cornet -1.4 3

Plotting Nominal data¶

First we will choose the column that is measured as a nominal scale and plot it using a pie chart and a bar chart.

In [10]:
# Count the occurrences of each category in the 'Category' column
category_counts_f = iceCreamData['Flavour'].value_counts()

# Create a pie chart
category_counts_f.plot(kind='pie', autopct='%1.1f%%')

# Adding a title
plt.title('Pie Chart of Flavour Distribution')
# Remove the x and y-label
plt.xlabel('')
plt.ylabel('')
# Display the chart
plt.axis('equal')  # Equal aspect ratio ensures that the pie chart is circular.
Out[10]:
(-1.1129928317635196,
 1.1006187320057477,
 -1.1006187372479035,
 1.1129928321258653)
In [11]:
# Count the occurrences of each category in the 'Category' column
category_counts_f = iceCreamData['Flavour'].value_counts()

# Create a bar chart
category_counts_f.plot(kind='bar')

# Adding labels and title
plt.xlabel('Flavours')
plt.ylabel('Count')
plt.title('Count of Categories')
Out[11]:
Text(0.5, 1.0, 'Count of Categories')

Plotting Ordinal data¶

In [ ]:
 
In [12]:
# Count the occurrences of each category in the 'Category' column
category_counts_c = iceCreamData['Cone'].value_counts()

# Create a pie chart
category_counts_c.plot(kind='pie', autopct='%1.1f%%')

# Adding a title
plt.title('Pie Chart of Cone Distribution')
# Remove the x and y-label
plt.xlabel('')
plt.ylabel('')
# Display the chart
plt.axis('equal')  # Equal aspect ratio ensures that the pie chart is circular.
Out[12]:
(-1.1000000009884476,
 1.100000020757401,
 -1.1129928306484105,
 1.1006187062213528)
In [23]:
# Count the occurrences of each category in the 'Category' column
category_counts_c = iceCreamData['Cone'].value_counts()

# Create a bar chart
category_counts_c.plot(kind='bar')

# Adding labels and title
plt.xlabel('Cone')
plt.ylabel('Count')
plt.title('Count of Categories')
Out[23]:
Text(0.5, 1.0, 'Count of Categories')
In [24]:
category_counts_c
Out[24]:
standard cornet    5
giant cone         3
mini cup           2
Name: Cone, dtype: int64

How to change the order of the bars¶

To change the order of the bars, we need to do 2 simple things:

  • set the desired order in a list
  • use the reindex() function to apply this order to values
# Get the counts of each category
values = df['category'].value_counts()

# Define the desired order of categories
desired_order = ['Group1', 'Group2', 'Group3']  # Change this order as needed
values = values.reindex(desired_order) # Reindex the values in the desired order

The create the plot as usual.

Note that if the index values are numbers (float or int) then we automatically sort them

desired_order = category_counts.index.sort_values()

However, ordering a numbered index does not work well with a bar chart!

In [25]:
# Define the desired order of categories
desired_order = ['mini cup', 'standard cornet', 'giant cone']  # Change this order as needed
category_counts_c = category_counts_c.reindex(desired_order) # Reindex the values in the desired order
In [26]:
category_counts_c
Out[26]:
mini cup           2
standard cornet    5
giant cone         3
Name: Cone, dtype: int64
In [27]:
# Create a bar chart
category_counts_c.plot(kind='bar')

# Adding labels and title
plt.xlabel('Cone')
plt.ylabel('Count')
plt.title('Count of Categories')
Out[27]:
Text(0.5, 1.0, 'Count of Categories')

Plotting Interval data¶

In [28]:
# Count the occurrences of each category in the 'Category' column
category_counts_t = iceCreamData['Temperature'].value_counts()

# Create a pie chart
category_counts_t.plot(kind='pie', autopct='%1.1f%%')

# Adding a title
plt.title('Pie Chart of Temperature Distribution')
# Remove the x and y-label
plt.xlabel('')
plt.ylabel('')
# Display the chart
plt.axis('equal')  # Equal aspect ratio ensures that the pie chart is circular.
Out[28]:
(-1.1000000056491936,
 1.100000010561534,
 -1.100000007859747,
 1.1000000029474057)
In [29]:
# I don't need to calculate the counts again I already
# did it in the cell above

# Create a bar chart
category_counts_t.plot(kind='bar')

# Adding labels and title
plt.xlabel('Temperature')
plt.ylabel('Count')
plt.title('Count of Tempertures')
Out[29]:
Text(0.5, 1.0, 'Count of Tempertures')

Observation: The pie chart does not tell us much because we lose the order. The bar chart is also not very informative because we lose the interpretation of the intervals.

In [36]:
# Create a histogram
iceCreamData['Temperature'].plot(kind='hist', bins=3, edgecolor='black', alpha=0.7)

# Adding labels and title
plt.xlabel('Temperature')
plt.ylabel('Frequency')
plt.title('Temperature Distribution Histogram')
Out[36]:
Text(0.5, 1.0, 'Temperature Distribution Histogram')

How many bins should I use? Perhaps 3? 5, 10, 25?

What do these plots tell you?¶

In [ ]:
 

Plotting Ratio data¶

In [37]:
# Count the occurrences of each category in the 'Category' column
category_counts_s = iceCreamData['Scoops'].value_counts()

# Create a pie chart
category_counts_s.plot(kind='pie', autopct='%1.1f%%')

# Adding a title
plt.title('Pie Chart of Scoops Distribution')
# Remove the x and y-label
plt.xlabel('')
plt.ylabel('')
# Display the chart
plt.axis('equal')  # Equal aspect ratio ensures that the pie chart is circular.
Out[37]:
(-1.1129928313852615,
 1.1006187216952246,
 -1.1008152309433816,
 1.1171195796322744)
In [39]:
# I don't need to calculate the counts again I already
# did it in the cell above

# Create a bar chart
category_counts_s.plot(kind='bar')

# Adding labels and title
plt.xlabel('Scoops')
plt.ylabel('Count')
plt.title('Count of Scoops')
Out[39]:
Text(0.5, 1.0, 'Count of Scoops')
In [41]:
# Create a histogram
iceCreamData['Scoops'].plot(kind='hist', bins=5, edgecolor='black', alpha=0.7)

# Adding labels and title
plt.xlabel('Scoops')
plt.ylabel('Frequency')
plt.title('Temperature Distribution Histogram')
Out[41]:
Text(0.5, 1.0, 'Temperature Distribution Histogram')
In [ ]:
 
In [ ]:
 
In [ ]:
 

Mode¶

The mode is the value that occurs most often.

What is the mode for each of the variables:

  • Flavour:
  • Cone:
  • Temperature:
  • Scoops:

Median¶

The median is the value that lies in the middle if we order the values from lowest to highest. 50% of the values lie below and 50% lies above.

What is the median for each of the variables:

  • Flavour:
  • Cone:
  • Temperature:
  • Scoops:
In [59]:
middle_values = np.sort(iceCreamData['Temperature'].values)[4:6]
np.mean(middle_values)
Out[59]:
-1.65
In [60]:
np.median(iceCreamData['Temperature'])
Out[60]:
-1.65
In [61]:
np.percentile(iceCreamData['Temperature'], 50)
Out[61]:
-1.65
In [ ]:
 

Mean¶

We calculate the mean by adding up all the values and dividing by the total number of values

What is the mean for each of the variables:

  • Flavour:
  • Cone:
  • Temperature:
  • Scoops:
In [62]:
np.sum(iceCreamData['Flavour'])
Out[62]:
'peachvanillavanillapeachpeachpeachchocolatepeachpeachvanilla'
In [64]:
np.sum(iceCreamData['Temperature'])/10
Out[64]:
-1.4500000000000002
In [71]:
meanScoops = np.sum(iceCreamData['Scoops'])/10
print(f'The mean number of scoops is {meanScoops}')
The mean number of scoops is 3.0

Measuring how much the values vary¶

With quantitative data, a natural question is how much do the values in the data vary? How large is the dispersion or spread of the data?

For instance, do all customers buy more or less the same amounts of ice cream or does the amount of ice cream vary a lot from customer to customer?

First let's look at how much the values deviate from the mean value...

In [80]:
# if all the values are the same then the difference 
# from the mean is zero for all purchases
np.ones(10)*3 - meanScoops
Out[80]:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
In [83]:
# when people buy different  amounts of ice cream
# we see that we have different distances from the mean
differenceFromTheMean = iceCreamData['Scoops'].values - meanScoops
differenceFromTheMean
Out[83]:
array([ 0., -2.,  1., -1., -1., -1.,  0., -1.,  5.,  0.])
In [86]:
# when we sum these differences, however, we always get zero
# this is because all the differences cancel each other out
np.sum(differenceFromTheMean)
Out[86]:
0.0
In [94]:
differenceFromTheMean * differenceFromTheMean
Out[94]:
array([ 0.,  4.,  1.,  1.,  1.,  1.,  0.,  1., 25.,  0.])
In [87]:
squaredDifference = np.square(differenceFromTheMean)
squaredDifference
Out[87]:
array([ 0.,  4.,  1.,  1.,  1.,  1.,  0.,  1., 25.,  0.])
In [90]:
# this is the mean of the squared difference from the mean
# also called variance
variance = np.sum(squaredDifference)/10
variance
Out[90]:
3.4
In [91]:
np.sqrt(variance)
Out[91]:
1.8439088914585775
In [92]:
np.sqrt(np.sum(np.square(iceCreamData['Scoops'].values - np.mean(iceCreamData['Scoops'])))/10)
Out[92]:
1.8439088914585775

Sensitiviy to outliers¶

The customer that bought 8 scoops is quite unusual. We might consider this value to be an outlier.

In [95]:
url = 'https://piratepeel.github.io/GlobalStudiesQuantMethodsS1/data/ice_cream_sales_no_outlier.csv'
iceCreamData_noOutlier = pd.read_csv(url)
iceCreamData_noOutlier.head(10)
Out[95]:
Flavour Cone Temperature Scoops
0 peach giant cone -0.4 3
1 vanilla mini cup -2.2 1
2 vanilla giant cone -2.0 4
3 peach standard cornet 0.4 2
4 peach standard cornet -1.9 2
5 peach mini cup -1.4 2
6 chocolate standard cornet -2.8 3
7 peach standard cornet -2.5 2
8 peach giant cone -0.3 4
9 vanilla standard cornet -1.4 3
In [101]:
np.mean(iceCreamData['Scoops'])
Out[101]:
3.0
In [102]:
np.mean(iceCreamData_noOutlier['Scoops'])
Out[102]:
2.6
In [103]:
np.median(iceCreamData['Scoops'])
Out[103]:
2.5
In [104]:
np.median(iceCreamData_noOutlier['Scoops'])
Out[104]:
2.5
In [105]:
np.sort(iceCreamData['Scoops'])
Out[105]:
array([1, 2, 2, 2, 2, 3, 3, 3, 4, 8])
In [106]:
np.sort(iceCreamData_noOutlier['Scoops'])
Out[106]:
array([1, 2, 2, 2, 2, 3, 3, 3, 4, 4])
In [110]:
scoops75 = np.percentile(iceCreamData['Scoops'], 75)
In [111]:
scoops25 = np.percentile(iceCreamData['Scoops'], 25)
In [112]:
scoops75 - scoops25
Out[112]:
1.0
In [ ]:
 
In [113]:
scoops75 = np.percentile(iceCreamData_noOutlier['Scoops'], 75)
In [114]:
scoops25 = np.percentile(iceCreamData_noOutlier['Scoops'], 25)
In [115]:
scoops75 - scoops25
Out[115]:
1.0
In [ ]: