Data Science for COVID-19 (DS4C) datasetThis a sample Notebook on using the EHRtemporalVariability R package for temporal exploratory data analysis of the Data Science for COVID-19 (DS4C) kaggle dataset. We focus on the epidemiological data of COVID-19 patients in South Korea. Due to the recent creation of the dataset, we are making a weekly analysis. The confirmation date is used as the reference date for the analysis.
Besides temporal exploratory data analysis, changes over time in the frequencies of variables can help delineating dataset shifts, which must be taken into account in further machine learning or statistical modelling on the data (e.g., distinct optimum model configurations might occur at distinct times). Although currently this dataset is limited in date range, this might be important in the future, as well as it can be in other SARS-CoV-2 and COVID-19 dasets spanning more time (e.g., CORD-19).
Follow the next steps to preprocess data and estimate Data Temporal Heatmaps and Information Geometric Temporal plots. Next, a sample code for displaying customized EHRtemporalVariability plots is provided, as well as to export the results for the interactive Shiny app.
If you use this tool, please cite, any of the related publications [1–4].
Install and load the EHRtemporalVariability package
install.packages("EHRtemporalVariability")
library(EHRtemporalVariability)
Download the DS4C patient info csv file frm the kaggle dataset and load it in R.
data = read.csv2('PatientInfo.csv', sep = ",", header = TRUE, na.strings = "", stringsAsFactors = FALSE, dec = '.',
colClasses = c( "character", #patient_id
"integer", #global_num
"factor", #sex
"numeric", #birth_year
"factor", #age
"factor", #country
"factor", #province
"factor", #city
"factor", #disease
"character", #infection_case
"character", #infected_by
"character", #contact_number
"character", #symptom_onset_date
"Date", #confirmed_date [MAIN DATE FOR ANALYSIS]
"Date", #released_date
"Date", #deceased_date
"factor" #state
))
Apply some formatting.
# Manually fix variable types
data$contact_number = as.integer(data$contact_number)
data$symptom_onset_date = as.Date(data$symptom_onset_date)
# remove rows with missing dates and remove patient_id column
data = data[!is.na(data$confirmed_date),-1]
# order by variable names (optional)
data = data[,order(names(data))]
Estimate DataTemporalMaps and IGTProjections. We avoid numeric smoothing given that some initial weeks do not have enough distinct data individuals to apply a kernel density estimation.
probMaps <- estimateDataTemporalMap( data = data,
dateColumnName = "confirmed_date",
period = "week",
numericSmoothing = FALSE)
igtProjs <- sapply ( probMaps, estimateIGTProjection, dimensions = 3)
Display Data Temporal Heatmaps (DTH) and IGT plots for age and province variables. Note that DTHs for categorical variables are by default sorted by the frequency. The start date is set a bit after the start date in the dataset to avoid the initial lesser populated weeks.
plotDataTemporalMap(probMaps$age, absolute = FALSE)
plotDataTemporalMap(probMaps$age, startDate = "2020-01-27", absolute = TRUE)
The DTH for age shows some weeks with high frequencies at specific ages. Switching to absolute frequencies we observe these weeks counted with few cases, thus leading to noisy distributions. We now observe a large amount of peope between 20-29 years old infected during two weeks after Feb 23.
plotIGTProjection(igtProjs$age, dimensions = 2, colorPalette = "Magma")
The IGT plot for age highlights as outlier points the aforementioned weeks with noisy frequencies.
plotDataTemporalMap(probMaps$province, startDate = "2020-01-27", absolute = TRUE)
The DTH for province shows also peaks of infections in Gyeongsangbuk-do the two weeks after Feb 23.
We optionally save results in an .RData file to be used in the Shiny app either locally or using the online app .
save(probMaps, igtProjs, file="variability_ds4c_weekly.RData")
1. Sáez C, Rodrigues PP, Gama J, Robles M, Garcı́a-Gómez JM. Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality. Data Mining and Knowledge Discovery. 2015;29:950–75.
2. Sáez C, Zurriaga O, Pérez-Panadés J, Melchor I, Robles M, Garcı́a-Gómez JM. Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in spain: A systematic approach to quality control of repositories. Journal of the American Medical Informatics Association. 2016;23:1085–95.
3. Sáez C, Garcı́a-Gómez JM. Kinematics of big biomedical data to characterize temporal variability and seasonality of data repositories: Functional data analysis of data temporal evolution over non-parametric statistical manifolds. International journal of medical informatics. 2018;119:109–24.
4. Sáez C, Gutiérrez Sacristán A, Kohane I, Garcı́a-Gómez JM and, Avillach P. EHRtemporalVariability: Delineating temporal dataset shifts in electronic health records. Submitted.