Learning Health Data Analysis in the Age of AI - Part 1
Using ChatGPT to generate hypotheses and create an analytic dataset
Generative AI is fantastic at doing some tasks that may be challenging for prior iterations of technology. Here, I set out to run a case study of some exploratory analysis and simple hypothesis testing.
Inspiration
This work was inspired by entrancing talks on AI delivered by Saif Ahmed and Imran Qureshi. My mind kept coming back to an important takeaway from that talk - you will probably not lose your job to AI, but will likely lose your job to a person using AI to be more productive.
Sleep Dataset
I was recently able to download my physical activity data collected by my smart watch (Fitbit Versa 3) beginning from April 29, 2022, to September 9, 2024 (about two and a half years). For this project, we will be extracting and analyzing daily sleep data as recorded by the device.
Approach
I plan to walk through a simple data analytics project that will include idea generation, data importation and cleaning, and generation of summary statistics and hypothesis testing. For the purpose of this project, we will exclusively use the free version of ChatGPT given its wide availability and name recognition. All code will be generated using the tidyr dialect of R language and will be shared on GitHub. One superpower of generative AI is I can take my code and translate it into any programming language I want.
Given the nuances and complexity of real-world data analytics, study design and idea generation have always been thought of as a pure human domain. However, I tested idea generation using AI with a simple ChatGPT prompt.
I have a data set of daily sleep duration in minutes for the past year.
Can you suggest 10 exploratory analyses and 10 testable hypotheses?
Exploratory Analyses:
1. Summary Statistics: Calculate the mean, median, standard deviation, and range of daily sleep duration to understand the central tendency and variability.
2. Sleep Distribution: Plot the distribution of daily sleep duration (e.g., histogram or kernel density estimate) to visualize the overall pattern.
3. Sleep Trends Over Time: Create a time series plot of daily sleep duration to examine any trends or seasonality across the year.
4. Monthly Averages: Calculate and plot the average sleep duration per month to observe any seasonal variations.
5. Weekday vs. Weekend Sleep: Compare sleep duration between weekdays and weekends using boxplots or bar plots.
6. Sleep Consistency: Calculate the coefficient of variation or other measures of sleep consistency across different weeks or months.
7. Sleep Patterns by Day of the Week: Plot average sleep duration for each day of the week to identify potential weekday vs. weekend effects.
8. Cumulative Sleep Deficit: Calculate the cumulative sleep deficit (e.g., sleep duration < 7 hours per night) over time to assess sleep deprivation.
9. Sleep Duration and Sleep Quality Correlation: If you have sleep quality scores (e.g., self-reported or from a device), explore the relationship between sleep duration and quality.
10. Outlier Detection: Identify any significant outliers or irregular sleep days (e.g., very long or very short sleep) and analyze their potential causes.
Testable Hypotheses:
1. There is no significant difference in average sleep duration between weekdays and weekends.
2. The average sleep duration does not significantly change across different months of the year.
3. There is no linear trend in sleep duration over the course of the year.
4. Sleep duration is not significantly correlated with a person's mood or productivity (if additional mood/productivity data is available).
5. Individuals get, on average, at least 7 hours of sleep per night.
6. Sleep duration is independent of the day of the week (no significant difference in sleep duration across days).
7. There is no significant difference in sleep variability (standard deviation) between weekdays and weekends.
8. There is no significant difference in sleep duration between summer and winter months.
9. Sleep deficit does not accumulate over time (e.g., no significant trend in cumulative sleep deficit).
10. Sleep duration is not significantly correlated with physical activity levels (if data on physical activity is available).
I am impressed by the list of summary statistics and hypotheses. I will say that ChatGPT got this part about 80% correct, which significantly reduces the amount of brainstorming I have to do in order to come up with the list.
Data Import
Fitbit data was provided in 30 JSON files, with each file containing one month of data. I have not worked with JSON files in the past, but with about 4 prompts and some light editing, mostly involving changing variable names and rearranging lines of code, I was able to successfully import and merge daily sleep data.
Here is a shortlist of prompts I used to get >80% of the code chunk below:
write tidyr code that imports a json file and covert it to tabular format
write tidyr code that creates a list containing the full path to all json files in a folder c/my/folder/path/to/sleep/
A list of json files are obtained and named sleep_files. Create tidy R code to import all json files and convert them to tabular format.
Using the code below, I got an error message also pasted. Debug the code chunk <<paste code snippet and error message>>
given a dataframe with a column sleep_hours. write tidyr code that generates histogram of sleep_hours using ggplot. Add a red dotted line to denote median sleep hours.
The code generated by these prompts is added to the end of this document.
Finally, a histogram showing that my median daily sleep duration is 7 hours, 21 minutes!
A few observations - there are some outlier days with >20 hours of sleep. These needs to be considered more carefully and then excluded. My initial hunch is that these may be days when I traveled internationally across time zones.
Below is my code so far that is >80% generated by ChatGPT.
In subsequent posts, I will go deeper into exploratory analyses and hypothesis testing.
library(tidyr)
library(jsonlite)
library(dplyr)
library(purrr)
library(ggplot2)
library(lubridate)
folder <- "C:/Users/musta/Documents/MyFitbitData/MM/Sleep/"
files <- list.files(folder, full.names = TRUE)
#select all files that start with "sleep" and end with ".json"
sleep_files <- files[grep("^sleep.*\\.json$", basename(files), ignore.case = TRUE)]
week_order <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
month_order <- c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November",
"December")
# Function to process a single JSON file
process_json_file <- function(file_path) {
# Read JSON file
json_data <- jsonlite::fromJSON(file_path, flatten = TRUE)
# Extract the filename without extension
file_name <- tools::file_path_sans_ext(basename(file_path))
# Convert to tibble and add filename as a column
df <- as_tibble(json_data) %>%
mutate(source_file = file_name)
# Unnest the data if it contains nested lists or arrays
# df_unnested <- df %>%
# unnest(cols = where(is.list), names_sep = "_", keep_empty = TRUE)
return(df)
}
# Process all sleep files
sleep_list <- list()
for (i in 1:length(sleep_files)) {
print(sleep_files[i])
cat(readLines(sleep_files[i], n = 5))
sleep_list[[i]] <- process_json_file(sleep_files[i])
}
# Combine all data frames into a single tibble
sleep_data <- bind_rows(sleep_list)
# Assuming sleep_data is your dataframe
sleep_daily <- sleep_data %>%
mutate(date = as.Date(dateOfSleep)) %>% # Ensure dateOfSleep is in Date format
select(-logId, -startTime, -endTime, -dateOfSleep, -source_file
) %>% # Remove unnecessary columns
group_by(date) %>%
summarise(across(where(is.numeric), list(
sum = ~sum(.x, na.rm = TRUE)
))) %>%
ungroup() %>%
mutate(week_day = weekdays(date),
week_day = factor(week_day, levels = week_order),
month = month(date, label = TRUE, abbr = FALSE),
month = factor(month, levels = month_order),
) %>% #calculate weekday
select(date, week_day, everything()) #make date the first column
View(sleep_daily)
# Create the histogram and add the median line
ggplot(sleep_daily, aes(x = sleep_hours)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
geom_vline(aes(xintercept = median_sleep), color = "red",
linetype = "dashed", size = 1) + # Add median line
labs(x = "Hours of Sleep", y = "Frequency",
title = "Histogram of Daily Sleep Hours with Median Line") +
theme_minimal()
Brilliant 🥰