Assignment 3: Using Python for Data Science

Author

Ryan M. Moore, PhD

Published

April 13, 2025

Modified

September 3, 2025

Note: To complete the assignment, you will need the material contained in the data directory. There, you will find the CSV files for the assignment. These files are slightly different than the ones from Chs. 7 & 8.

Overview

In this assignment, you will work with common Python data science packages to analyze cancer death rates and their relationship to demographic and social factors across different states. You’ll use packages such as pandas, NumPy, SciPy, Seaborn, and statsmodels to manipulate data, perform statistical analysis, and create visualizations.

Learning Objectives

Import and use common Python data science libraries
Clean and prepare datasets for analysis
Merge multiple datasets using pandas
Normalize data to account for population differences
Perform statistical tests (ANOVA, Tukey’s HSD)
Create data visualizations for correlation analysis
Conduct Principal Component Analysis (PCA)

Data Files

You will work with several datasets:

cancer_deaths.csv: Cancer death rates by state
state_demographics.csv: Demographic information by state
bls_regions.csv: Bureau of Labor Statistics regions for each state
social_determinants.csv: Social determinants of health by state

Instructions

Part 0: Install Needed Packages

Ensure that you have the following packages installed:

numpy
pandas
scipy
seaborn
statsmodels

Once you have them installed, you can put the following codeblock near the top of your Quarto document.

import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns

from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.multivariate.pca import PCA

pd.set_option("display.max_rows", 6)

Part 1: Data Preparation

Load the cancer death rates dataset
Load and prepare the state demographics dataset:

Extract state populations into a separate table
Convert raw demographic counts to rates per 1000 people

Load the BLS regions dataset
Load and prepare the social determinants dataset:

Merge with state population data
Convert raw counts to rates per 1000 people
Pivot the data frame to have variables as columns and states as rows

Notes: - Use pandas - See Ch. 7 for an example

Part 2: Data Exploration

Examine Predictors By Themselves

Merge state demographics, state regions, and social determinants data into a single data frame called predictors. You will need to chain together multiple merge() calls. Use a left join to ensure that all states remain in the resulting data frame.
Create a correlation heatmap (clustermap()) to explore relationships between predictors

Notes: - Use seaborn’s clustermap() for the heatmap - See Ch. 7 for an example of this

Examine Cancer Death Rates & Predictors

Join the cancer deaths data frame to the predictors data frame (use a left join again) called cancer
Generate a correlation matrix from the cancer data frame

This will give you a square matrix, which you need to filter as follows:
- Keep Cancer Death Rate.Total, Cancer Death Rate.Breast, Cancer Death Rate.Colorectal, Cancer Death Rate.Lung as columns
- Remove those same four variables from the rows
- Run the correlation heatmap on the resulting non-square correlation matrix

Notes: - Use seaborn’s clustermap() for the heatmap - See Ch. 7 for an example of this (there is an almost identical example for the filtering in that chapter)

Part 3: Statistical Testing

Analyze cancer death rates by region

Perform an ANOVA test to determine if there are significant differences between regions
If there is a significant difference, conduct Tukey’s HSD post-hoc test to identify which specific regions differ

Notes: - Use SciPy for the ANOVA - Use statsmodels for the Tukey’s HSD - See Ch. 8 for example code

Part 4: Principal Components Analysis (PCA)

Perform Principal Component Analysis (PCA):

Prepare the data by dropping non-numeric columns
Run PCA with standardization
Create a scatter plot of the first two principal components, colored by region

Notes: - Use statsmodels for the PCA - Use seaborn for the plot - See Ch. 8 for example code for both the PCA and the plot function

Part 5: Interpretation

This is an assignment, not a miniproject, so you only have to write a single sentence for these three items. E.g., “the p-value for the ANOVA was 0.12 so we do not reject the null hypothesis”, or, “strong trends in the PCA plot were seen: states grouped by BLS region”, that sort of thing.

Comment on the correlation patterns you observe in the heatmap
Interpret the results of the ANOVA and Tukey’s HSD tests
Analyze the PCA results and discuss any patterns observed

Requirements

Create your solution in a Quarto notebook
Use the requested libraries
Your code should be well-commented and organized
- Each logical section of code should be in its own code block (like how I wrote previous chapters and assignments)
- Above code blocks, write a sentence of two describing what the following code block is about to do
Include the requested plots
Provide concise interpretations of your statistical findings (just one sentence is necessary)

Submission

Turn in a Quarto notebook with your name in the title, e.g., assignment_3__ryan_moore.qmd
This notebook should contain your code, visualizations, and interpretations.

Hints

Pay attention to how data is normalized to account for different state populations (see Ch. 7 for an example)
When merging datasets, be careful with the join type (left, right, inner, outer)
Pretty much everything you need to do has a very similar example in either Ch. 7 or Ch. 8, but you may need to go back to the documentation for the given packages for more info on the required functions and parameters.

This assignment will help you gain practical experience with data manipulation, statistical analysis, and visualization using Python’s data science ecosystem. This will be a bit of a jump in difficulty, since you will need to put together example code from the chapters/tutorials rather than me giving it to you in this document. Good luck!