Learning Health Data Analysis in the Age of AI - Part 2
Exploratory Data Analyses using Github Copilot
This is a continuation of my health data analysis series using generative AI.
Part 1 - Learning Health Data Analysis in the Age of AI - hypothesis generation and data import.
If you can’t wait to scroll to the end, here is how much sleep I get during workdays vs weekends.
Remember we extracted daily sleep episodes in minutes and summed up daily sleep episode to generate sleep_hours - a measure of daily total sleep (in hours). We then asked ChatGPT for a range of exploratory analyses and hypotheses we can test on this dataset.
Prompt: What is exploratory data analysis?
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
Steps in EDA
1. Data Collection
2. Data Cleaning
3. Data Preprocessing
4. Data Exploration
5. Data Visualization
6. Data Modeling
Approach: In this exercise, we use take the first five exploratory analyses suggested by ChatGPT and ask ChatGPT to provide us with the code needed to run the analyses. In this instance, I was pressed for time (had about 1.5 hours to spare and I was at a cafe with no additional computer monitor), so I used Github Copilot ($10/month from Microsoft) connected to RStudio. Copilot work similarly to ChatGPT but has 2 advantages.
No need for switching windows, your code is pasted directly into RStudio without having to use/copy paste
Less code tweaking is needed from my side since Copilot has access to all my previous code and better understands my variable names. ChatGPT sometimes provides code that you need to adapt to your coding environment.
First, I removed days with <2 hours or >16 hours of sleep as outliers.
Prompt: remove outliers for sleep_hours >16 or sleep_hours < 2
sleep_daily_clean <- sleep_daily[sleep_daily$sleep_hours < 16 & sleep_daily$sleep_hours > 2,]
Prompt: Calculate the mean, median, standard deviation, and range of daily sleep duration to understand the central tendency and variability.
cat("Mean sleep hours (with outliers): ", round(mean(sleep_daily$sleep_hours), 2), "\n")
...
Prompt: Plot the distribution of daily sleep duration using histogram and kernel density estimate to visualize the overall pattern of sleep_daily_clean$sleep_hours with median line
Prompt: Create a time series plot of daily sleep duration to examine any trends or seasonality across the year.
This plot is generated from the code below, all provided by Copilot. I only changed poly(x, 2) to poly(x, 5) to adjust the curvature of the polynomial.
ggplot(sleep_daily_clean, aes(x = date, y = sleep_hours)) +
geom_line(color = "skyblue") +
geom_smooth(method = "lm", formula = y ~ poly(x, 5), color = "red") +
labs(x = "Date", y = "Hours of Sleep",
title = "Daily Sleep Duration Over Time") +
theme_minimal()I wanted to focus a bit more on 2024, so I asked Copilot to do so:
Seems like my sleep duration is on an increasing trend during late August through early September, 2024.
Prompt: Compare sleep duration between weekdays and weekends using boxplots or bar plots.
Looks like I probably sleep longer during weekends, though some work days I may sleep a LOT (probably during work holidays or travel days).
That was fun! Took me less time to do these analyses than it took to write up this post. So, Copilot definitely
Next, we will go into some hypothesis testing in Part 3








