6 Reproducibility

6.1 Working with articles

One of the skills that are useful in datascience is looking at a paper and using the code in the paper to get the exact same result with the dataset. This is call reproducibility.

One of the skills I have learned is to Judge an article based on its reproducibility. To show my skill on juding reproducibiility a grabbed a random article from pubmed you can find this article here

This article looks at the safety of some of the first covid-19 vaccines.

6.1.1 Article References

Edward E. Walsh, M.D., Robert W. Frenck, Jr., M.D., Ann R. Falsey, M.D., Nicholas Kitchin, M.D., Judith Absalon, M.D.,corresponding author Alejandra Gurtman, M.D., Stephen Lockhart, D.M., Kathleen Neuzil, M.D., Mark J. Mulligan, M.D., Ruth Bailey, B.Sc., Kena A. Swanson, Ph.D., Ping Li, Ph.D., Kenneth Koury, Ph.D., Warren Kalina, Ph.D., David Cooper, Ph.D., Camila Fontes-Garfias, B.Sc., Pei-Yong Shi, Ph.D., Özlem Türeci, M.D., Kristin R. Tompkins, B.Sc., Kirsten E. Lyke, M.D., Vanessa Raabe, M.D., Philip R. Dormitzer, M.D., Kathrin U. Jansen, Ph.D., Uğur Şahin, M.D., and William C. Gruber, M.D.

6.1.2 Criteria article

Transparency Criteria Definition Response Type Response type Article
Study Purpose A concise statement in the introduction of the article, often in the last paragraph, that establishes the reason the research was conducted. Also called the study objective. Binary Yes
Data Availability Statement A statement, in an individual section offset from the main body of text, that explains how or if one can access a study’s data. The title of the section may vary, but it must explicitly mention data; it is therefore distinct from a supplementary materials section. Binary Yes
Data Location Where the article’s data can be accessed, either raw or processed. Found Value Email the author
Study Location Author has stated in the methods section where the study took place or the data’s country/region of origin. Binary; Found Value No
Author Review The professionalism of the contact information that the author has provided in the manuscript. Found Value Found by author
information
Ethics Statement A statement within the manuscript indicating any ethical concerns, including the presence of sensitive data. Binary Yes
Funding Statement A statement within the manuscript indicating whether or not the authors received funding for their research. Binary Yes
Code Availability Authors have shared access to the most updated code that they used in their study, including code used for analysis. Binary No

6.1.3 Other data

The protocol, the ethics statement and the data sharing statement are founde beneath the article.

6.1.4 Extra’s about the article

In general this is a good reproducible article that test the safety of multiple covid-19 vaccines with the help of a placebo trial. On of the things that made this article very reproducible is because it is covid-19 related. After the first wave of covid scientist started openly sharing their data to make the best vaccines as soon as possible.

6.2 Working with code from articles

In data science it is also a useful skill to work with open science articles. These articles show their code and show how they got their data. In this part of the page I look at some code and try to reproduce it of a random article. The article is found here

When looking at the code I can see that the data analysts first made some function for the purpose of creating some graphs. After this the load in the required data and look for missing value data. After cleaning the data the started making some graphs.

One of the things that I look at when analysing the code is if the code hase comments describing what is being done in the code. This code has comments above every code but the comment are not always as clear as can be.

After I analysed the code I copied some of the code to look at how easy it is to reproduce the article.

library(here)
# read data ---------------------------------------------------------------
# sample 1
sample1 <- read.csv(here("data","sample1_use for revision 1.csv"))
sample1 <- sample1[sample1$include_1,]

# sample 2
sample2 <- read.csv(here("data","sample2_use for revision 1.csv"))
sample2 <- sample2[sample2$include_1,]

# combine samples for the additional analyses
data <- rbind(sample1, sample2)

# packages ----------------------------------------------------------------
# functie inladen
gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}

get_max_daily <- function(x){
  x <- x + c(1:4, 9:12, 17:20, 25:27)
  x <- x[x < ymd("2020-04-13 UTC")]
  return(x)
}

# install.packages("lubridate")
library(lubridate)

# -------------------------------------------------------------------------

## define colors
c <- gg_color_hue(2)
col.s1 <- c[1]
col.s2 <- c[2]

# completed baseline surveys per day that are included in the study
data_l2_s1 <- unique(sample1[c("ID", "b_baseline_ended")])
data_l2_s2 <- unique(sample2[c("ID", "b_baseline_ended")])

# no missings on this variable
any(is.na(data_l2_s1[2]))
## [1] FALSE
any(is.na(data_l2_s2[2]))
## [1] FALSE
date_s1 <- ymd_hms(data_l2_s1$b_baseline_ended)
hour(date_s1) <- 0
minute(date_s1) <- 0
second(date_s1) <- 0

date_s2 <- ymd_hms(data_l2_s2$b_baseline_ended)
hour(date_s2) <- 0
minute(date_s2) <- 0
second(date_s2) <- 0


# number of baseline surveys completed
t_date_baseline_1 <- table(ymd(date_s1))
t_date_baseline_2 <- table(ymd(date_s2))
t_date_baseline_1 <- c(t_date_baseline_1, "2020-04-12" = 0)
t_date_baseline_12 <- rbind(t_date_baseline_1, t_date_baseline_2)

# barplot sample 1 and 2 baseline participation ---------------------------

layout(matrix(1:2, 1, 2, byrow = TRUE))

b <- barplot(t_date_baseline_12, beside = TRUE, ylim = c(0, 800), names.arg = rep("", length(t_date_baseline_12)),
             col = c(col.s1, col.s2), axes = FALSE)
box()
axis(2, las = 2, cex.axis = 0.8)
s <- seq(1, 28, 7)
labels <- colnames(t_date_baseline_12)[s]
axis(1, at = ((b[1,] + b[2,])/2)[s], labels = labels, cex.axis = 0.8)
text("A", x = ((b[1,] + b[2,])/2)[2], y = 800*.9, cex = 4, col = "grey")
legend(x = ((b[1,] + b[2,])/2)[15], y = 800, legend = c("Sample 1", "Sample 2"), fill = c(col.s1, col.s2), bty = "n")


# dates at which a daily survey could have been completed:
max_daily_dates_s1 <- lapply(ymd(date_s1), get_max_daily)
max_daily_dates_s2 <- lapply(ymd(date_s2), get_max_daily)

max_dates_s1 <- do.call("c", max_daily_dates_s1)
max_dates_s2 <- do.call("c", max_daily_dates_s2)
obtained_dates_s1 <- sample1$daily_date
obtained_dates_s2 <- sample2$daily_date

t_max_dates_s1 <- table(max_dates_s1)
t_max_dates_s2 <- table(max_dates_s2)

t_obtained_dates_s1 <- table(obtained_dates_s1)
t_obtained_dates_s2 <- table(obtained_dates_s2)

t_obtained_dates_s12 <- rbind(t_obtained_dates_s1, t_obtained_dates_s2)

b <- barplot(t_obtained_dates_s12, beside = TRUE, ylim = c(0, 2000),
             names.arg = rep("", length(t_obtained_dates_s12)),
             col = c(col.s1, col.s2), axes = FALSE)
box()
axis(2, las = 2, cex.axis = 0.8)
labels <- colnames(t_obtained_dates_s12)[s]
axis(1, at = ((b[1,] + b[2,])/2)[s], labels = labels, cex.axis = 0.8)

lines(y = t_max_dates_s1, x = b[1,], type = "p", pch = "-", col = col.s1, cex = 1.5)
lines(y = t_max_dates_s2, x = b[2,], type = "p", pch = "-", col = col.s2, cex = 1.5)
text("B", x = ((b[1,] + b[2,])/2)[2], y = 2000*.9, cex = 4, col = "grey")
legend(x = ((b[1,] + b[2,])/2)[15], y = 2000, legend = c("Sample 1", "Sample 2"), fill = c(col.s1, col.s2), bty = "n")

# difference obtained and maximum -----------------------------------------

obtained_total <- sum(t_obtained_dates_s12)
max_reports_total <- sum(c(t_max_dates_s1, t_max_dates_s2))
round(obtained_total/max_reports_total*100, 2)
## [1] 75.74
# average surveys per day -------------------------------------------------

median(colSums(t_obtained_dates_s12))
## [1] 1468
# average number of loneliness scores per participant ---------------------

s1_lst <- split(sample1, f = sample1$ID)
n_1 <- unlist(lapply(s1_lst, function(x) sum(!is.na(x$loneliness))))
mean(n_1)
## [1] 8.150741
sd(n_1)
## [1] 3.359446
range(n_1)
## [1]  0 15
s2_lst <- split(sample2, f = sample2$ID)
n_2 <- unlist(lapply(s2_lst, function(x) sum(!is.na(x$loneliness))))
mean(n_2)
## [1] 8.124586
sd(n_2)
## [1] 3.399153
range(n_2)
## [1]  0 15
# distribution of participants across number of daily surveys
layout(1)
b <- barplot(table(c(n_1, n_2)), main = "", ylab = "valid loneliness scores",
             xlab = "number of completed daily surveys",
             ylim = c(0, 900), lwd = 1.3)
box(lwd = 1.3)

# surveys, not valid measures
n_1 <- unlist(lapply(s1_lst, function(x) nrow(x)))
n_2 <- unlist(lapply(s2_lst, function(x) nrow(x)))
mean(n_1)
## [1] 8.269769
sd(n_1)
## [1] 3.318693
range(n_1)
## [1]  1 15
mean(n_2)
## [1] 8.25207
sd(n_2)
## [1] 3.359575
range(n_2)
## [1]  1 15
b <- barplot(table(c(n_1, n_2)), main = "", ylab = "N",
             xlab = "number of completed daily surveys",
             ylim = c(0, 1000), lwd = 1.3)
box(lwd = 1.3)

# N and measurement occasions ---------------------------------------------

sample1_lst <- split(sample1, f = sample1$ID)
sample2_lst <- split(sample2, f = sample2$ID)

length(sample1_lst)
## [1] 2428
length(sample2_lst)
## [1] 2416
length(sample1_lst) + length(sample2_lst)
## [1] 4844
nrow(sample1)
## [1] 20079
nrow(sample2)
## [1] 19937
nrow(sample1) + nrow(sample2)
## [1] 40016
# demographic variables ---------------------------------------------------

names <- names(sample1)
variables_level_2 <- grep(x = names, pattern = "ID|group|b_|var_|federal", value = TRUE)
data_l2_s1 <- unique(sample1[variables_level_2])
data_l2_s2 <- unique(sample2[variables_level_2])

## age
mean(data_l2_s1$b_demo_age_1)
## [1] 37.29462
sd(data_l2_s1$b_demo_age_1)
## [1] 14.32523
mean(data_l2_s2$b_demo_age_1)
## [1] 37.57201
sd(data_l2_s2$b_demo_age_1)
## [1] 14.24426
# generate level 2 data frame for all participants
data_l2 <- unique(data[c(grep(x = names(data), pattern = "ID|b_demo|b_work|b_corona", value = TRUE))])

mean(data_l2$b_demo_age_1)
## [1] 37.87923
sd(data_l2$b_demo_age_1)
## [1] 14.40773
barplot(table(data_l2$b_demo_age_1), main = "Distribution of age across the total sample", lwd = 1.3, ylim = c(0, 200))
box(lwd = 1.3)

## quartiles of age
quantile(data_l2$b_demo_age_1)
##   0%  25%  50%  75% 100% 
##   18   27   34   48   88

There are a few things I had to change within the code:

  • I had to change the path the data was led to

  • Some functions where loaded in another file, I copied them to the code, the functions were: “gg_color_hue” and “get_max_daily”

One of the things with the output of the graphs is that they don’t contain y-axis descriptions. This article was pretty easy to reproduce the code only needed a few changes. To rate the article on reproducible I would rate it about a four on a scale of one to five.