Introduction

This a sample Notebook on using the EHRtemporalVariability R package for temporal exploratory data analysis of the kaggle COVID-19 Open Research Dataset Challenge (CORD-19). CORD-19 is a resource of over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.

Besides temporal exploratory data analysis, changes over time in the frequencies of variables or, in this case, in the frequencies of text words and combinations of words, can help delineating dataset shifts, which must be taken into account in further machine learning or statistical modelling on the data (e.g., distinct word embedding configurations might occur at distinct times, rather than estimating the embeddings using the complete dataset). Note that in this notebook we are not analyzing the word embedding representations nor addressing the specific tasks of the kaggle challenge. This intends to be a guide to explore temporal changes in the CORD-19 data which might have further implications in data science analyses.

Follow the next steps to preprocess data and estimate Data Temporal Heatmaps and Information Geometric Temporal plots. Next, a sample code for displaying customized EHRtemporalVariability plots is provided, as well as to export the results for the interactive Shiny app.

If you use this tool, please cite, any of the related publications [1–4].

First, install and load the EHRtemporalVariability package and additional packages.

install.packages("text2vec")
install.packages("corpus")
install.packages("data.table")
install.packages("xts")
install.packages("zoo")
install.packages("EHRtemporalVariability")
library(text2vec)
library(corpus)
library(data.table)
library(xts)
library(zoo)
library(EHRtemporalVariability)

CORD-19 data loading

Download the CORD-19 metadata csv file frm the kaggle dataset and load it in R.

data = read.csv2('metadata.csv', sep = ",", header = TRUE, na.strings = "", stringsAsFactors = FALSE, 
                 colClasses = c( "character", #sha
                                 "factor",    #source_x
                                 "character", #title
                                 "character", #doi
                                 "character", #pmcid
                                 "character", #pubmed_id
                                 "factor",    #license
                                 "character", #abstract
                                 "Date",      #publish_time [MAIN DATE FOR ANALYSIS]
                                 "character", #authors
                                 "factor",    #journal
                                 "character", #Microsoft.Academic.Paper.ID
                                 "character", #WHO..Covidence
                                 "factor",    #has_full_text
                                 "factor"     #full_text_file
                                 ))

# remove rows with missing dates
data = data[!is.na(data$publish_time),]

Text pre-processing for 1-gram in titles

prep_fun = tolower
tok_fun = word_tokenizer

it_text_title = itoken(data$title, 
                  preprocessor = prep_fun, 
                  tokenizer = tok_fun,
                  progressbar = TRUE)

stop_words = c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now")

vocabN1_title = create_vocabulary(it_text_title, stopwords = stop_words)

vocabN1_title = prune_vocabulary(vocabN1_title, term_count_min = 10, doc_proportion_max = 0.99,
                           doc_proportion_min = 0.001)

vectorizerN1_title = vocab_vectorizer(vocabN1_title)

dtm_text_N1_title = create_dtm(it_text_title, vectorizerN1_title)

Text pre-processing for 3-gram in abstracts

it_text_abstract = itoken(data$abstract, 
                  preprocessor = prep_fun, 
                  tokenizer = tok_fun,
                  progressbar = TRUE)

vocabN3_abstract = create_vocabulary(it_text_abstract, ngram = c(3, 3), stopwords = stop_words)

vocabN3_abstract = prune_vocabulary(vocabN3_abstract, term_count_min = 10, doc_proportion_max = 0.99,
                           doc_proportion_min = 0.001)

vectorizerN3_abstract = vocab_vectorizer(vocabN3_abstract)

dtm_text_N3_abstract = create_dtm(it_text_abstract, vectorizerN3_abstract)

Manual generation of Data Temporal Heatmaps for texts

The current version of EHRtemporalVariability pacakge does not directly support free text variables as input. But the package allows the manual generation of Data Temporal Maps (DTMs) through the DataTemporalHetmap class (see ?DataTemporalHetmap help) constructor, as we will do now to input the frequencies of text n-grams over time.

First we create the DTM for the title 1-gram.

dataxts = xts(as.matrix(dtm_text_N1_title),order.by = data$publish_time)
sumMonths = apply.monthly(dataxts, FUN = colSums)
countsMonths = coredata(sumMonths)
probMonths = sweep(countsMonths,1,rowSums(countsMonths),"/")
probMapTitleN1 <- new('DataTemporalMap', probabilityMap = probMonths, 
                   countsMap = countsMonths, dates = as.Date(index(sumMonths)), support = data.frame(names(sumMonths),stringsAsFactors = FALSE), 
                   variableName = "Title 1-gram", variableType = "factor", period = "month")
igtTitleN1 <- estimateIGTProjection(probMapTitleN1)

Next we create the DTM for the abstract 3-gram.

dataxts = xts(as.matrix(dtm_text_N3_abstract),order.by = data$publish_time)
sumMonths = apply.monthly(dataxts, FUN = colSums)
countsMonths = coredata(sumMonths)
probMonths = sweep(countsMonths,1,rowSums(countsMonths),"/")
probMapAbstractN3 <- new('DataTemporalMap', probabilityMap = probMonths, 
                      countsMap = countsMonths, dates = as.Date(index(sumMonths)), support = data.frame(names(sumMonths),stringsAsFactors = FALSE), 
                      variableName = "Abstract 3-gram", variableType = "factor", period = "month")
igtAbstractN3 <- estimateIGTProjection(probMapAbstractN3)

Create results for Data Temporal Maps and Information Geometric Temporal plots

Estimate DataTemporalMaps and IGTProjections, at a monthly period. We restrict to the source_x,license,journal and publish_time variables.

probMaps <- estimateDataTemporalMap( data           = data[,c('source_x','license','journal','publish_time')], 
                                     dateColumnName = "publish_time", 
                                     period         = "month")
igtProjs <- sapply ( probMaps, estimateIGTProjection, dimensions = 3)

Add to the results the previously, manually calculated DTMs and IGTprojs for the texts.

probMaps[["Title 1-gram"]] <- probMapTitleN1
probMaps[["Abstract 3-gram"]] <- probMapAbstractN3
igtProjs[["Title 1-gram"]] <- igtTitleN1
igtProjs[["Abstract 3-gram"]] <- igtAbstractN3

Plot, refine results and export results

Display Data Temporal Heatmaps (DTH) and IGT plots for Title 1-gram, Abstract 3-gram and journal variables. Note that DTHs for categorical variables are by default sorted by the frequency.

We first observe all the time span for Title 1-gram, displaying only the most frequent 20 terms in the DTH.

plotDataTemporalMap(probMaps$`Title 1-gram`, endValue = 20)

Next, the IGT plot for Title 1-gram shows a clear abrupt shift between the terms used in aproximately 2001.

plotIGTProjection(igtProjs$`Title 1-gram`, dimensions = 3)

We can now make a clearer zoom in the DTH to explore the changing values.

plotDataTemporalMap(probMaps$`Title 1-gram`, startDate = "1990-01-01", endValue = 25)

Additionally, although possibly not significant here, if we made another zoom starting 2004 we will notice the higher frequency of the term chapter the last two month each year, which were also displayed in the previous IGTplot within a could of points.

Next we get the DTH for Abstract 3-gram, which helps visualizing the shift in the three consecutive words as in the 3-gram, the very high incidence of severe acute respiratory syndrome at the begginig of the displayed time (which in fact started at the May 2003 SARS), but we also can observe the latest incidence of COVID-19 in values such as ‘sars-cov-2’ but also again in ‘severe_acute_respiratory_syndrome’ (note the 2 first two rows from below could be understood as mainly a highly frequent 4-gram).

plotDataTemporalMap(probMaps$`Abstract 3-gram`, startDate = "2004-01-01", endValue = 51)

Then, we re-calculate the IGT at this zoomed time to avoid embedding noise caused by the previous months. And we can observe the abrupt shift before, the continuous trend during all the period, and the outlying months the 2020.

igtProjAbstractN3 = estimateIGTProjection(probMaps$`Abstract 3-gram`, startDate = "2004-01-01", dimensions = 3)

plotIGTProjection(igtProjAbstractN3, dimensions = 2)

As a last note, we show the DTHs for journal and license. Besides the changes of trends in journals, another finding can be related to an increse in the biorxiv publications in the last years.

plotDataTemporalMap(probMaps$journal, startDate = "1990-01-01", endDate = "2020-03-27", endValue = 20)
plotDataTemporalMap(probMaps$license, startDate = "2004-01-01", endDate = "2020-03-27", endValue = 4)

We optionally save results in an .RData file to be used in the Shiny app either locally or using the online app .

save(probMaps, igtProjs, file="variability_cord-19_monthly_notebook.RData")

Bibliography

1. Sáez C, Rodrigues PP, Gama J, Robles M, Garcı́a-Gómez JM. Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality. Data Mining and Knowledge Discovery. 2015;29:950–75.

2. Sáez C, Zurriaga O, Pérez-Panadés J, Melchor I, Robles M, Garcı́a-Gómez JM. Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in spain: A systematic approach to quality control of repositories. Journal of the American Medical Informatics Association. 2016;23:1085–95.

3. Sáez C, Garcı́a-Gómez JM. Kinematics of big biomedical data to characterize temporal variability and seasonality of data repositories: Functional data analysis of data temporal evolution over non-parametric statistical manifolds. International journal of medical informatics. 2018;119:109–24.

4. Sáez C, Gutiérrez Sacristán A, Kohane I, Garcı́a-Gómez JM and, Avillach P. EHRtemporalVariability: Delineating temporal dataset shifts in electronic health records. Submitted.