5  Data summaries and visualizations in R

5.1 Data source

The National Health and Nutrition Examination Survey has been conducted since 1971 by the National Center for Health Statistics. The survey collects demographic, socioeconomic, dietary, and health data. The nhgh data available from the Hmisc package is a sample of responses from the survey. The data dictionary is here (link). We use this data to show how one might generate summary tables and visualizations in R.

Hmisc::getHdata(nhgh)

# Peek at the data
nhgh |> head()
    seqn    sex      age                 re        income tx dx    wt    ht
1  51624   male 34.16667 Non-Hispanic White [25000,35000)  0  0  87.4 164.7
3  51626   male 16.83333 Non-Hispanic Black [45000,55000)  0  0  72.3 181.3
5  51628 female 60.16667 Non-Hispanic Black [10000,15000)  1  1 116.8 166.0
6  51629   male 26.08333   Mexican American [25000,35000)  0  0  97.6 173.0
7  51630 female 49.66667 Non-Hispanic White [35000,45000)  0  0  86.7 168.4
10 51633   male 80.00000 Non-Hispanic White [15000,20000)  0  1  79.1 174.3
     bmi  leg arml armc waist  tri  sub  gh albumin bun  SCr
1  32.22 41.5 40.0 36.4 100.4 16.4 24.9 5.2     4.8   6 0.94
3  22.00 42.0 39.5 26.6  74.7 10.2 10.5 5.7     4.6   9 0.89
5  42.39 35.3 39.0 42.2 118.2 29.6 35.6 6.0     3.9  10 1.11
6  32.61 41.7 38.7 37.0 103.7 19.0 23.2 5.1     4.2   8 0.80
7  30.57 37.5 36.1 33.3 107.8 30.3 28.0 5.3     4.3  13 0.79
10 26.04 42.8 40.0 30.2  91.1  8.6 15.2 5.4     4.3  16 0.83

5.2 Summary tables

There are many, many table making packages in R. Here are a few options.

5.2.1 Binary Variables

There are a few binary variables in the dataset, including sex, tx (using insulin or diabetes medications), and dx (diagnosed with diabetes or pre-diabetes). A simple tabular summary of the data is the counts and proportions.

suppressPackageStartupMessages(require(table1))
table1(~ sex, data=nhgh)
Overall
(N=6795)
sex
male 3372 (49.6%)
female 3423 (50.4%)
require(table1)
table1(~ sex, data=nhgh)
Overall
(N=6795)
sex
male 3372 (49.6%)
female 3423 (50.4%)

We can add dx and tx directly to the table. Note the following:

  • The variable labels for dx and tx were automatically added to the table.
  • The command did not calculate the count and percentage for each group. Rather, it gave the output that is generated for a continuous varable.
table1(~ sex + dx + tx, data=nhgh)
Overall
(N=6795)
sex
male 3372 (49.6%)
female 3423 (50.4%)
Diagnosed with DM or Pre-DM
Mean (SD) 0.135 (0.341)
Median [Min, Max] 0 [0, 1.00]
On Insulin or Diabetes Meds
Mean (SD) 0.0918 (0.289)
Median [Min, Max] 0 [0, 1.00]

Where did the labels come from? Why did the command generate the output for a continuous variable?

The label of the variable is stored as meta data. If we peek at the structure of the dataframe, we see the following info:

str(nhgh[,c("dx","tx")])
'data.frame':   6795 obs. of  2 variables:
 $ dx: 'labelled' int  0 0 1 0 0 1 1 0 0 1 ...
  ..- attr(*, "label")= chr "Diagnosed with DM or Pre-DM"
 $ tx: 'labelled' int  0 0 1 0 0 0 1 0 0 1 ...
  ..- attr(*, "label")= chr "On Insulin or Diabetes Meds"

Notice that the label attribute is specified for both variables. This is the metadata that the table1 command is using when it creates the table. Suppose we wanted to tweak the label.

label(nhgh$tx) <- "Using insulin or diabetes medications"
table1(~ sex + dx + tx, data=nhgh)
Overall
(N=6795)
sex
male 3372 (49.6%)
female 3423 (50.4%)
Diagnosed with DM or Pre-DM
Mean (SD) 0.135 (0.341)
Median [Min, Max] 0 [0, 1.00]
Using insulin or diabetes medications
Mean (SD) 0.0918 (0.289)
Median [Min, Max] 0 [0, 1.00]

Also note that tx and dx are listed as int variables. For table1 to treat tx and dx as categorical, binary variables, we need to change the variable type from integer to factor. Here, we keep the original data in-tack and add tx_f and dx_f.

nhgh$dx_f <- factor(nhgh$dx, 0:1, c("No","Yes"))
label(nhgh$dx_f) <- label(nhgh$dx)
nhgh$tx_f <- factor(nhgh$tx, 0:1, c("No","Yes"))
label(nhgh$tx_f) <- label(nhgh$tx)

A quick peek at the data reveals the new variables.

head(nhgh)
    seqn    sex      age                 re        income tx dx    wt    ht
1  51624   male 34.16667 Non-Hispanic White [25000,35000)  0  0  87.4 164.7
3  51626   male 16.83333 Non-Hispanic Black [45000,55000)  0  0  72.3 181.3
5  51628 female 60.16667 Non-Hispanic Black [10000,15000)  1  1 116.8 166.0
6  51629   male 26.08333   Mexican American [25000,35000)  0  0  97.6 173.0
7  51630 female 49.66667 Non-Hispanic White [35000,45000)  0  0  86.7 168.4
10 51633   male 80.00000 Non-Hispanic White [15000,20000)  0  1  79.1 174.3
     bmi  leg arml armc waist  tri  sub  gh albumin bun  SCr dx_f tx_f
1  32.22 41.5 40.0 36.4 100.4 16.4 24.9 5.2     4.8   6 0.94   No   No
3  22.00 42.0 39.5 26.6  74.7 10.2 10.5 5.7     4.6   9 0.89   No   No
5  42.39 35.3 39.0 42.2 118.2 29.6 35.6 6.0     3.9  10 1.11  Yes  Yes
6  32.61 41.7 38.7 37.0 103.7 19.0 23.2 5.1     4.2   8 0.80   No   No
7  30.57 37.5 36.1 33.3 107.8 30.3 28.0 5.3     4.3  13 0.79   No   No
10 26.04 42.8 40.0 30.2  91.1  8.6 15.2 5.4     4.3  16 0.83  Yes   No

The table formula updated with the new variables generates the counts and percentages for each variable.

table1(~ sex + dx_f + tx_f, data=nhgh)
Overall
(N=6795)
sex
male 3372 (49.6%)
female 3423 (50.4%)
Diagnosed with DM or Pre-DM
No 5881 (86.5%)
Yes 914 (13.5%)
Using insulin or diabetes medications
No 6171 (90.8%)
Yes 624 (9.2%)

5.2.2 Categorical variables

Categorical variables can be added to the table in a similar way.

table1(~ sex + dx_f + tx_f + re + income, data=nhgh)
Overall
(N=6795)
sex
male 3372 (49.6%)
female 3423 (50.4%)
Diagnosed with DM or Pre-DM
No 5881 (86.5%)
Yes 914 (13.5%)
Using insulin or diabetes medications
No 6171 (90.8%)
Yes 624 (9.2%)
Race/Ethnicity
Mexican American 1366 (20.1%)
Other Hispanic 706 (10.4%)
Non-Hispanic White 3117 (45.9%)
Non-Hispanic Black 1217 (17.9%)
Other Race Including Multi-Racial 389 (5.7%)
Family Income
[0,5000) 235 (3.5%)
[5000,10000) 315 (4.6%)
[10000,15000) 531 (7.8%)
[15000,20000) 456 (6.7%)
[20000,25000) 563 (8.3%)
[25000,35000) 845 (12.4%)
[35000,45000) 610 (9.0%)
[45000,55000) 522 (7.7%)
[55000,65000) 376 (5.5%)
[65000,75000) 274 (4.0%)
> 20000 232 (3.4%)
< 20000 75 (1.1%)
[75000,100000) 564 (8.3%)
>= 100000 877 (12.9%)
Missing 320 (4.7%)

5.2.3 Continuous variables

Continuous variables will be summarized with mean and standard deviation by default.

table1(~ sex + dx_f + tx_f + re + age + wt + ht + bmi + leg, data=nhgh)
Overall
(N=6795)
sex
male 3372 (49.6%)
female 3423 (50.4%)
Diagnosed with DM or Pre-DM
No 5881 (86.5%)
Yes 914 (13.5%)
Using insulin or diabetes medications
No 6171 (90.8%)
Yes 624 (9.2%)
Race/Ethnicity
Mexican American 1366 (20.1%)
Other Hispanic 706 (10.4%)
Non-Hispanic White 3117 (45.9%)
Non-Hispanic Black 1217 (17.9%)
Other Race Including Multi-Racial 389 (5.7%)
Age (years)
Mean (SD) 44.3 (20.6)
Median [Min, Max] 43.8 [12.0, 80.0]
Weight (kg)
Mean (SD) 79.4 (21.9)
Median [Min, Max] 76.3 [28.0, 239]
Standing Height (cm)
Mean (SD) 167 (10.3)
Median [Min, Max] 167 [123, 203]
Body Mass Index (kg/m^2)
Mean (SD) 28.3 (6.95)
Median [Min, Max] 27.3 [13.2, 84.9]
Upper Leg Length (cm)
Mean (SD) 38.4 (3.88)
Median [Min, Max] 38.4 [20.4, 50.6]
Missing 231 (3.4%)

Notice that the units of measurements are added to the table. This information is also already in the dataframe as an attribute:

str(nhgh[,"tri"])
 'labelled' num [1:6795] 16.4 10.2 29.6 19 30.3 8.6 19.4 12.4 13.8 27 ...
 - attr(*, "label")= chr "Triceps Skinfold"
 - attr(*, "units")= chr "mm"

5.2.4 Stratification

Often, we need to stratify a summary table. For example, suppose we wanted to compare those diagnosed with diabetes vs those not diagnosed. We can use the | syntax in the table formula.

table1(~ sex + tx_f + age + wt | dx_f, data=nhgh)
No
(N=5881)
Yes
(N=914)
Overall
(N=6795)
sex
male 2927 (49.8%) 445 (48.7%) 3372 (49.6%)
female 2954 (50.2%) 469 (51.3%) 3423 (50.4%)
Using insulin or diabetes medications
No 5864 (99.7%) 307 (33.6%) 6171 (90.8%)
Yes 17 (0.3%) 607 (66.4%) 624 (9.2%)
Age (years)
Mean (SD) 41.8 (20.3) 60.0 (15.1) 44.3 (20.6)
Median [Min, Max] 40.3 [12.0, 80.0] 62.1 [12.6, 80.0] 43.8 [12.0, 80.0]
Weight (kg)
Mean (SD) 77.7 (21.1) 89.8 (24.4) 79.4 (21.9)
Median [Min, Max] 74.9 [28.0, 239] 86.6 [37.0, 231] 76.3 [28.0, 239]

Notice that the labels may need to be different or a header added to the table. The reader will not know what Yes and No refer to.

table1(~ sex + tx_f + age + wt | dx_f, data=nhgh, caption = label(nhgh$dx_f))
Diagnosed with DM or Pre-DM
No
(N=5881)
Yes
(N=914)
Overall
(N=6795)
sex
male 2927 (49.8%) 445 (48.7%) 3372 (49.6%)
female 2954 (50.2%) 469 (51.3%) 3423 (50.4%)
Using insulin or diabetes medications
No 5864 (99.7%) 307 (33.6%) 6171 (90.8%)
Yes 17 (0.3%) 607 (66.4%) 624 (9.2%)
Age (years)
Mean (SD) 41.8 (20.3) 60.0 (15.1) 44.3 (20.6)
Median [Min, Max] 40.3 [12.0, 80.0] 62.1 [12.6, 80.0] 43.8 [12.0, 80.0]
Weight (kg)
Mean (SD) 77.7 (21.1) 89.8 (24.4) 79.4 (21.9)
Median [Min, Max] 74.9 [28.0, 239] 86.6 [37.0, 231] 76.3 [28.0, 239]

5.3 Summary visualizations

When creating visualizations, the first step is to answer the following question:

What do I want my reader to glean from the visualization?

The answer to this question will dictate what and how you create the visualization. If the goal is a general summary, a simple expression of the raw data distribution is perfectly adequate. If the goal is comparison between groups, the summary measures to compare should be side-by-side.

5.3.1 General summary

5.3.1.1 Categorical

t1 <- table(nhgh$re) |> proportions()
names(t1)[5] <- "Other Race Including\nMulti-Racial"
x <- barplot(t1, ylim = c(0,max(t1)*1.1), xaxt = "n", main = "Barplot: Proportion of survey respondants by race and ethnicity")
axis(1, at = x, labels = names(t1), cex.axis = .7, tck = 0)
box()

See here for adjusting/rotating labels (link).

suppressPackageStartupMessages(library(ggplot2))
t2 <- as.data.frame(t1)
names(t2) <- c("re", "p")

# Create the bar plot
ggplot(t2, aes(x = re, y = p)) +
  geom_bar(stat = "identity") +
  labs(title = "Barplot: Proportion of survey respondants by race and ethnicity", x = "", y = "Proportion") +
  theme_minimal() +
  theme(panel.border = element_rect(colour = "black", fill = NA),
    panel.grid = element_blank())

5.3.1.2 Continuous

get_label <- function(x){
  uu <- ifelse(has.units(x), gsub("U",units(x),"(U)"), "")
  ll <- ifelse(has.label(x), label(x), "No label")
  paste(ll,uu)
}

hist(
    nhgh$bmi
  , breaks = 100
  , freq = FALSE
  , main = "Histogram: Distribution of BMI"
  , xlab = get_label(nhgh$bmi)
  , las = 1
)
box()

ggplot(nhgh, aes(x = bmi)) +
  geom_histogram(aes(y = after_stat(density)), bins = 100, color = "black") +
  labs(title = "Histogram: Distribution of BMI", x = get_label(nhgh$bmi), y = "Density") +
  theme_minimal() +
  theme(
    panel.border = element_rect(colour = "black", fill = NA),
    panel.grid = element_blank()
  )

plot(ecdf(nhgh$bmi), main = "ECDF: Distribution of BMI", xlab = get_label(nhgh$bmi))
box()

ggplot(nhgh, aes(x = bmi)) +
  stat_ecdf(geom = "step") +
  labs(title = "ECDF: Distribution of BMI", x = get_label(nhgh$bmi), y = "ECDF") +
  theme_minimal() +
  theme(
    panel.border = element_rect(colour = "black", fill = NA),
    panel.grid = element_blank()
  )

The box plot is a 5 number summary of the distribution. It is most often used when comparing distributions across groups.

boxplot(nhgh$bmi, main = "Boxplot: BMI", ylab = get_label(nhgh$bmi))

ggplot(nhgh, aes(y = bmi)) +
  geom_boxplot() +
  labs(title = "Boxplot: BMI", y = get_label(nhgh$bmi), x = NULL) + 
  theme_minimal() +
  theme(
      panel.border = element_rect(fill = NA)
    , panel.grid = element_blank()
    , axis.text.x = element_blank()
  )

5.3.2 Comparison across groups

5.3.2.1 Binary

It may be important to understand diagnosis diabetes rates across racial and ethnic groups. The quanity of interest is

\[ P(\text{dx = Yes} \,|\, \text{re}) \] Here is one way to calculate the conditional proportion with the dplyr package.

suppressPackageStartupMessages(require(dplyr))
t1 <- nhgh |> 
  group_by(re) |>
  summarise(p = mean(dx)) %>% 
  as.data.frame()
  
t1 <- t1 |> 
  mutate(re = re |> gsub(" ", "\n",x=_))
t1[5,1] <- "Other Race\nIncluding Multi-racial"
barplot(t1[,"p"],names.arg = t1[,"re"], main = "Barplot: Proportion diagnosed with diabetes", cex.names = .7)

ggplot(t1, aes(x = re, y = p)) +
  geom_bar(stat = "identity") +
  labs(title = "Barplot: Proportion Diagnosed with Diabetes", x = "", y = "Proportion") +
  theme_minimal() +
  theme(
    panel.border = element_rect(fill = NA),
    panel.grid = element_blank()
  )

5.3.2.2 Categorical

In this plot, the other conditional proportion is plotted.

\[ P(\text{re}\,|\,\text{dx}) \]

t1 <- table(nhgh$dx_f, nhgh$re)
m <- rowSums(t1)  
t2 <- t1/m
colnames(t2) <- colnames(t2) |> gsub(" ", "\n",x=_)
colnames(t2)[5] <- "Other Race\nIncluding Multi-racial"
cols  <- c(rgb(1,0,0,1/4),rgb(0,0,1,1/4))

barplot(t2, beside=TRUE, ylim = c(0,max(t2)*1.1), main = "Grouped bar plot: Race and ethnicity proportions by diagnosis status", col = cols, cex.names = .8)
box()
legend("topright", rownames(t2), pch = 15, col = cols, bty = "n")

t3 <- as.data.frame(t2)
custom_colors <- c("Yes" = rgb(1,0,0,1/4), "No" = rgb(0,0,1,1/4))


t3 |>
  setNames(c("Group","Category","Value")) |>
  ggplot(aes(x = Category, y = Value, fill = Group)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  scale_fill_manual(values = custom_colors) +
  labs(title="Grouped bar plot: Race and ethnicity proportions by diagnosis status", x = "Category", y = "Value") +
  theme_minimal() +
  theme(
    panel.border = element_rect(fill = NA),
    panel.grid = element_blank()
  )

5.3.2.3 Continuous

h1 <- nhgh %>% 
  select(ht,sex) %>% 
  split(.$sex)

rb <- c(rgb(0,0,1,1/4), rgb(1,0,0,1/4)) 

hist(
    h1[[1]]$ht
  , freq = FALSE
  , col = rb[1]
  , breaks = 50
  , main = "Overlayed histograms: Height by sex"
  , xlab = get_label(nhgh$ht)
  , xlim = range(nhgh$ht)
  , ylim = c(0,0.06)
  , border = "white"
)
hist(
    h1[[2]]$ht
  , add=TRUE
  , freq = FALSE
  , col = rb[2]
  , breaks = 50
  , border = "white"
)
box()
legend("topleft", legend = names(h1), pch = 15, col = rb, bty = "n")

ggplot(nhgh, aes(x = ht, fill = sex)) +
  geom_histogram(
      aes(y = after_stat(density))
    , bins = 100
    , position = "identity"
    , alpha = 0.25
  ) +
  scale_fill_manual(values = rb) +  
  labs(title = "Overlayed histograms: Height by sex", x = get_label(nhgh$ht), y = "Density") +
  theme_minimal() +
  theme(
    panel.border = element_rect(fill = NA),
    panel.grid = element_blank() 
  ) +
  guides(fill = guide_legend(title = "Sex"))

boxplot(ht ~ sex, data = nhgh, col = rb, ylab = get_label(nhgh$ht), main = "Boxplots: Height by sex")

ggplot(nhgh, aes(x = sex, y = ht, fill = sex)) +
  geom_boxplot() +
  scale_fill_manual(values = rb) +
  labs(y = get_label(nhgh$ht), x = "Sex", title = "Boxplots: Height by sex") +
  theme_minimal() +
  theme(
    panel.border = element_rect(fill = NA),
    panel.grid = element_blank()
  ) +
  guides(fill = "none")

5.3.3 Covariation

5.3.3.1 Scatter plot

arml_label <- get_label(nhgh$arml)
leg_label <- get_label(nhgh$leg)

plot(nhgh$arml, nhgh$leg, xlab = arml_label, ylab = leg_label, main = "Scatterplot")

Add transparency.

plot(nhgh$arml, nhgh$leg, xlab = arml_label, ylab = leg_label, pch = 16, col = "#00000010", main = "Scatterplot")

nhgh |> 
  filter(!is.na(arml)) |>
  filter(!is.na(leg)) |>
  ggplot(aes(x = arml, y = leg)) +
  geom_point() +
  labs(title = "Scatterplot", x = arml_label, y = leg_label) +
  theme_minimal() +
  theme(
    panel.border = element_rect(fill = NA),
    panel.grid = element_blank()
  )

Add transparency.

nhgh |> 
  filter(!is.na(arml)) |>
  filter(!is.na(leg)) |>
  ggplot(aes(x = arml, y = leg)) +
  geom_point(color = "#000000", alpha = 0.125, shape = 16) +
  labs(title = "Scatterplot", x = arml_label, y = leg_label) +
  theme_minimal() +
  theme(
    panel.border = element_rect(fill = NA),
    panel.grid = element_blank()
  )

5.4 Combo table and figures

sparkline::sparkline(0)
des <- Hmisc::describe(nhgh)
print(des, 'continuous')

nhgh Descriptives

15 Continous Variables of 22 Variables, 6795 Observations
Variable Label Units n Missing Distinct Info Mean Gini |Δ| Quantiles
.05 .10 .25 .50 .75 .90 .95
seqn Respondent sequence number 6795 0 6795 1.000 56872 3515 52125 52626 54246 56873 59511 61095 61627
age Age years 6795 0 813 1.000 44.29 23.74 14.33 16.75 25.67 43.75 61.33 74.08 80.00
wt Weight kg 6795 0 1022 1.000 79.37 23.77 49.97 54.60 64.00 76.30 91.10 107.10 119.00
ht Standing Height cm 6795 0 532 1.000 167 11.68 150.8 153.9 159.6 166.6 174.5 180.5 184.2
bmi Body Mass Index kg/m2 6795 0 2389 1.000 28.32 7.481 19.15 20.58 23.43 27.29 31.88 37.23 41.00
leg Upper Leg Length cm 6564 231 232 1.000 38.41 4.365 31.92 33.40 36.00 38.40 41.00 43.30 44.70
arml Upper Arm Length cm 6616 179 166 1.000 36.87 3.152 32.5 33.4 35.0 36.8 38.8 40.5 41.6
armc Arm Circumference cm 6607 188 320 1.000 32.49 5.899 24.50 26.00 28.85 32.10 35.60 39.30 41.70
waist Waist Circumference cm 6556 239 820 1.000 96.25 19.15 70.80 74.95 83.50 95.30 106.90 118.50 126.50
tri Triceps Skinfold mm 6314 481 354 1.000 18.79 9.497 7.0 8.4 12.0 17.9 25.0 31.0 33.8
sub Subscapular Skinfold mm 5824 971 343 1.000 19.96 9.587 7.7 9.1 13.0 19.4 26.2 31.8 35.0
gh Glycohemoglobin % 6795 0 99 0.995 5.677 0.799 4.8 5.0 5.2 5.5 5.8 6.4 7.2
albumin Albumin g/dL 6706 89 28 0.991 4.274 0.3635 3.7 3.9 4.1 4.3 4.5 4.7 4.8
bun Blood urea nitrogen mg/dL 6706 89 59 0.995 12.92 5.572 6 8 9 12 15 19 22
SCr Creatinine mg/dL 6706 89 205 1.000 0.8786 0.2907 0.55 0.60 0.70 0.83 0.98 1.14 1.28

5.5 Interactive visualizations

suppressPackageStartupMessages(require(Hmisc))
suppressPackageStartupMessages(require(plotly))
options(grType='plotly') 
plot(summaryM(ht + wt + SCr + bun ~ sex, data=nhgh))