5 Data summaries and visualizations in R

5.1 Data source

The National Health and Nutrition Examination Survey has been conducted since 1971 by the National Center for Health Statistics. The survey collects demographic, socioeconomic, dietary, and health data. The nhgh data available from the Hmisc package is a sample of responses from the survey. The data dictionary is here (link). We use this data to show how one might generate summary tables and visualizations in R.

Hmisc::getHdata(nhgh)

# Peek at the data
nhgh |> head()

    seqn    sex      age                 re        income tx dx    wt    ht
1  51624   male 34.16667 Non-Hispanic White [25000,35000)  0  0  87.4 164.7
3  51626   male 16.83333 Non-Hispanic Black [45000,55000)  0  0  72.3 181.3
5  51628 female 60.16667 Non-Hispanic Black [10000,15000)  1  1 116.8 166.0
6  51629   male 26.08333   Mexican American [25000,35000)  0  0  97.6 173.0
7  51630 female 49.66667 Non-Hispanic White [35000,45000)  0  0  86.7 168.4
10 51633   male 80.00000 Non-Hispanic White [15000,20000)  0  1  79.1 174.3
     bmi  leg arml armc waist  tri  sub  gh albumin bun  SCr
1  32.22 41.5 40.0 36.4 100.4 16.4 24.9 5.2     4.8   6 0.94
3  22.00 42.0 39.5 26.6  74.7 10.2 10.5 5.7     4.6   9 0.89
5  42.39 35.3 39.0 42.2 118.2 29.6 35.6 6.0     3.9  10 1.11
6  32.61 41.7 38.7 37.0 103.7 19.0 23.2 5.1     4.2   8 0.80
7  30.57 37.5 36.1 33.3 107.8 30.3 28.0 5.3     4.3  13 0.79
10 26.04 42.8 40.0 30.2  91.1  8.6 15.2 5.4     4.3  16 0.83

5.2 Summary tables

There are many, many table making packages in R. Here are a few options.

Table1 package (link)

5.2.1 Binary Variables

There are a few binary variables in the dataset, including sex, tx (using insulin or diabetes medications), and dx (diagnosed with diabetes or pre-diabetes). A simple tabular summary of the data is the counts and proportions.

suppressPackageStartupMessages(require(table1))
table1(~ sex, data=nhgh)

	Overall (N=6795)
sex
male	3372 (49.6%)
female	3423 (50.4%)

require(table1)
table1(~ sex, data=nhgh)

	Overall (N=6795)
sex
male	3372 (49.6%)
female	3423 (50.4%)

We can add dx and tx directly to the table. Note the following:

The variable labels for dx and tx were automatically added to the table.
The command did not calculate the count and percentage for each group. Rather, it gave the output that is generated for a continuous varable.

table1(~ sex + dx + tx, data=nhgh)

	Overall (N=6795)
sex
male	3372 (49.6%)
female	3423 (50.4%)
Diagnosed with DM or Pre-DM
Mean (SD)	0.135 (0.341)
Median [Min, Max]	0 [0, 1.00]
On Insulin or Diabetes Meds
Mean (SD)	0.0918 (0.289)
Median [Min, Max]	0 [0, 1.00]

Where did the labels come from? Why did the command generate the output for a continuous variable?

The label of the variable is stored as meta data. If we peek at the structure of the dataframe, we see the following info:

str(nhgh[,c("dx","tx")])

'data.frame':   6795 obs. of  2 variables:
 $ dx: 'labelled' int  0 0 1 0 0 1 1 0 0 1 ...
  ..- attr(*, "label")= chr "Diagnosed with DM or Pre-DM"
 $ tx: 'labelled' int  0 0 1 0 0 0 1 0 0 1 ...
  ..- attr(*, "label")= chr "On Insulin or Diabetes Meds"

Notice that the label attribute is specified for both variables. This is the metadata that the table1 command is using when it creates the table. Suppose we wanted to tweak the label.

label(nhgh$tx) <- "Using insulin or diabetes medications"
table1(~ sex + dx + tx, data=nhgh)

	Overall (N=6795)
sex
male	3372 (49.6%)
female	3423 (50.4%)
Diagnosed with DM or Pre-DM
Mean (SD)	0.135 (0.341)
Median [Min, Max]	0 [0, 1.00]
Using insulin or diabetes medications
Mean (SD)	0.0918 (0.289)
Median [Min, Max]	0 [0, 1.00]

Also note that tx and dx are listed as int variables. For table1 to treat tx and dx as categorical, binary variables, we need to change the variable type from integer to factor. Here, we keep the original data in-tack and add tx_f and dx_f.

nhgh$dx_f <- factor(nhgh$dx, 0:1, c("No","Yes"))
label(nhgh$dx_f) <- label(nhgh$dx)
nhgh$tx_f <- factor(nhgh$tx, 0:1, c("No","Yes"))
label(nhgh$tx_f) <- label(nhgh$tx)

A quick peek at the data reveals the new variables.

head(nhgh)

    seqn    sex      age                 re        income tx dx    wt    ht
1  51624   male 34.16667 Non-Hispanic White [25000,35000)  0  0  87.4 164.7
3  51626   male 16.83333 Non-Hispanic Black [45000,55000)  0  0  72.3 181.3
5  51628 female 60.16667 Non-Hispanic Black [10000,15000)  1  1 116.8 166.0
6  51629   male 26.08333   Mexican American [25000,35000)  0  0  97.6 173.0
7  51630 female 49.66667 Non-Hispanic White [35000,45000)  0  0  86.7 168.4
10 51633   male 80.00000 Non-Hispanic White [15000,20000)  0  1  79.1 174.3
     bmi  leg arml armc waist  tri  sub  gh albumin bun  SCr dx_f tx_f
1  32.22 41.5 40.0 36.4 100.4 16.4 24.9 5.2     4.8   6 0.94   No   No
3  22.00 42.0 39.5 26.6  74.7 10.2 10.5 5.7     4.6   9 0.89   No   No
5  42.39 35.3 39.0 42.2 118.2 29.6 35.6 6.0     3.9  10 1.11  Yes  Yes
6  32.61 41.7 38.7 37.0 103.7 19.0 23.2 5.1     4.2   8 0.80   No   No
7  30.57 37.5 36.1 33.3 107.8 30.3 28.0 5.3     4.3  13 0.79   No   No
10 26.04 42.8 40.0 30.2  91.1  8.6 15.2 5.4     4.3  16 0.83  Yes   No

The table formula updated with the new variables generates the counts and percentages for each variable.

table1(~ sex + dx_f + tx_f, data=nhgh)

	Overall (N=6795)
sex
male	3372 (49.6%)
female	3423 (50.4%)
Diagnosed with DM or Pre-DM
No	5881 (86.5%)
Yes	914 (13.5%)
Using insulin or diabetes medications
No	6171 (90.8%)
Yes	624 (9.2%)

5.2.2 Categorical variables

Categorical variables can be added to the table in a similar way.

table1(~ sex + dx_f + tx_f + re + income, data=nhgh)

	Overall (N=6795)
sex
male	3372 (49.6%)
female	3423 (50.4%)
Diagnosed with DM or Pre-DM
No	5881 (86.5%)
Yes	914 (13.5%)
Using insulin or diabetes medications
No	6171 (90.8%)
Yes	624 (9.2%)
Race/Ethnicity
Mexican American	1366 (20.1%)
Other Hispanic	706 (10.4%)
Non-Hispanic White	3117 (45.9%)
Non-Hispanic Black	1217 (17.9%)
Other Race Including Multi-Racial	389 (5.7%)
Family Income
[0,5000)	235 (3.5%)
[5000,10000)	315 (4.6%)
[10000,15000)	531 (7.8%)
[15000,20000)	456 (6.7%)
[20000,25000)	563 (8.3%)
[25000,35000)	845 (12.4%)
[35000,45000)	610 (9.0%)
[45000,55000)	522 (7.7%)
[55000,65000)	376 (5.5%)
[65000,75000)	274 (4.0%)
> 20000	232 (3.4%)
< 20000	75 (1.1%)
[75000,100000)	564 (8.3%)
>= 100000	877 (12.9%)
Missing	320 (4.7%)

5.2.3 Continuous variables

Continuous variables will be summarized with mean and standard deviation by default.

table1(~ sex + dx_f + tx_f + re + age + wt + ht + bmi + leg, data=nhgh)

	Overall (N=6795)
sex
male	3372 (49.6%)
female	3423 (50.4%)
Diagnosed with DM or Pre-DM
No	5881 (86.5%)
Yes	914 (13.5%)
Using insulin or diabetes medications
No	6171 (90.8%)
Yes	624 (9.2%)
Race/Ethnicity
Mexican American	1366 (20.1%)
Other Hispanic	706 (10.4%)
Non-Hispanic White	3117 (45.9%)
Non-Hispanic Black	1217 (17.9%)
Other Race Including Multi-Racial	389 (5.7%)
Age (years)
Mean (SD)	44.3 (20.6)
Median [Min, Max]	43.8 [12.0, 80.0]
Weight (kg)
Mean (SD)	79.4 (21.9)
Median [Min, Max]	76.3 [28.0, 239]
Standing Height (cm)
Mean (SD)	167 (10.3)
Median [Min, Max]	167 [123, 203]
Body Mass Index (kg/m^2)
Mean (SD)	28.3 (6.95)
Median [Min, Max]	27.3 [13.2, 84.9]
Upper Leg Length (cm)
Mean (SD)	38.4 (3.88)
Median [Min, Max]	38.4 [20.4, 50.6]
Missing	231 (3.4%)

Notice that the units of measurements are added to the table. This information is also already in the dataframe as an attribute:

str(nhgh[,"tri"])

 'labelled' num [1:6795] 16.4 10.2 29.6 19 30.3 8.6 19.4 12.4 13.8 27 ...
 - attr(*, "label")= chr "Triceps Skinfold"
 - attr(*, "units")= chr "mm"

5.2.4 Stratification

Often, we need to stratify a summary table. For example, suppose we wanted to compare those diagnosed with diabetes vs those not diagnosed. We can use the | syntax in the table formula.

table1(~ sex + tx_f + age + wt | dx_f, data=nhgh)

	No (N=5881)	Yes (N=914)	Overall (N=6795)
sex
male	2927 (49.8%)	445 (48.7%)	3372 (49.6%)
female	2954 (50.2%)	469 (51.3%)	3423 (50.4%)
Using insulin or diabetes medications
No	5864 (99.7%)	307 (33.6%)	6171 (90.8%)
Yes	17 (0.3%)	607 (66.4%)	624 (9.2%)
Age (years)
Mean (SD)	41.8 (20.3)	60.0 (15.1)	44.3 (20.6)
Median [Min, Max]	40.3 [12.0, 80.0]	62.1 [12.6, 80.0]	43.8 [12.0, 80.0]
Weight (kg)
Mean (SD)	77.7 (21.1)	89.8 (24.4)	79.4 (21.9)
Median [Min, Max]	74.9 [28.0, 239]	86.6 [37.0, 231]	76.3 [28.0, 239]

Notice that the labels may need to be different or a header added to the table. The reader will not know what Yes and No refer to.

table1(~ sex + tx_f + age + wt | dx_f, data=nhgh, caption = label(nhgh$dx_f))

Diagnosed with DM or Pre-DM
	No (N=5881)	Yes (N=914)	Overall (N=6795)
sex
male	2927 (49.8%)	445 (48.7%)	3372 (49.6%)
female	2954 (50.2%)	469 (51.3%)	3423 (50.4%)
Using insulin or diabetes medications
No	5864 (99.7%)	307 (33.6%)	6171 (90.8%)
Yes	17 (0.3%)	607 (66.4%)	624 (9.2%)
Age (years)
Mean (SD)	41.8 (20.3)	60.0 (15.1)	44.3 (20.6)
Median [Min, Max]	40.3 [12.0, 80.0]	62.1 [12.6, 80.0]	43.8 [12.0, 80.0]
Weight (kg)
Mean (SD)	77.7 (21.1)	89.8 (24.4)	79.4 (21.9)
Median [Min, Max]	74.9 [28.0, 239]	86.6 [37.0, 231]	76.3 [28.0, 239]

5.3 Summary visualizations

When creating visualizations, the first step is to answer the following question:

What do I want my reader to glean from the visualization?

The answer to this question will dictate what and how you create the visualization. If the goal is a general summary, a simple expression of the raw data distribution is perfectly adequate. If the goal is comparison between groups, the summary measures to compare should be side-by-side.

5.3.1 General summary

t1 <- table(nhgh$re) |> proportions()
names(t1)[5] <- "Other Race Including\nMulti-Racial"
x <- barplot(t1, ylim = c(0,max(t1)*1.1), xaxt = "n", main = "Barplot: Proportion of survey respondants by race and ethnicity")
axis(1, at = x, labels = names(t1), cex.axis = .7, tck = 0)
box()

See here for adjusting/rotating labels (link).

suppressPackageStartupMessages(library(ggplot2))
t2 <- as.data.frame(t1)
names(t2) <- c("re", "p")

# Create the bar plot
ggplot(t2, aes(x = re, y = p)) +
  geom_bar(stat = "identity") +
  labs(title = "Barplot: Proportion of survey respondants by race and ethnicity", x = "", y = "Proportion") +
  theme_minimal() +
  theme(panel.border = element_rect(colour = "black", fill = NA),
    panel.grid = element_blank())

5.3.1.2 Continuous

Base
ggplot

get_label <- function(x){
  uu <- ifelse(has.units(x), gsub("U",units(x),"(U)"), "")
  ll <- ifelse(has.label(x), label(x), "No label")
  paste(ll,uu)
}

hist(
    nhgh$bmi
  , breaks = 100
  , freq = FALSE
  , main = "Histogram: Distribution of BMI"
  , xlab = get_label(nhgh$bmi)
  , las = 1
)
box()

ggplot(nhgh, aes(x = bmi)) +
  geom_histogram(aes(y = after_stat(density)), bins = 100, color = "black") +
  labs(title = "Histogram: Distribution of BMI", x = get_label(nhgh$bmi), y = "Density") +
  theme_minimal() +
  theme(
    panel.border = element_rect(colour = "black", fill = NA),
    panel.grid = element_blank()
  )

Base
ggplot

plot(ecdf(nhgh$bmi), main = "ECDF: Distribution of BMI", xlab = get_label(nhgh$bmi))
box()

ggplot(nhgh, aes(x = bmi)) +
  stat_ecdf(geom = "step") +
  labs(title = "ECDF: Distribution of BMI", x = get_label(nhgh$bmi), y = "ECDF") +
  theme_minimal() +
  theme(
    panel.border = element_rect(colour = "black", fill = NA),
    panel.grid = element_blank()
  )

The box plot is a 5 number summary of the distribution. It is most often used when comparing distributions across groups.

Base
ggplot

boxplot(nhgh$bmi, main = "Boxplot: BMI", ylab = get_label(nhgh$bmi))

ggplot(nhgh, aes(y = bmi)) +
  geom_boxplot() +
  labs(title = "Boxplot: BMI", y = get_label(nhgh$bmi), x = NULL) + 
  theme_minimal() +
  theme(
      panel.border = element_rect(fill = NA)
    , panel.grid = element_blank()
    , axis.text.x = element_blank()
  )

5.3.2 Comparison across groups

5.3.2.1 Binary

It may be important to understand diagnosis diabetes rates across racial and ethnic groups. The quanity of interest is

\[ P(\text{dx = Yes} \,|\, \text{re}) \] Here is one way to calculate the conditional proportion with the dplyr package.

suppressPackageStartupMessages(require(dplyr))
t1 <- nhgh |> 
  group_by(re) |>
  summarise(p = mean(dx)) %>% 
  as.data.frame()
  
t1 <- t1 |> 
  mutate(re = re |> gsub(" ", "\n",x=_))
t1[5,1] <- "Other Race\nIncluding Multi-racial"

Base
ggplot

barplot(t1[,"p"],names.arg = t1[,"re"], main = "Barplot: Proportion diagnosed with diabetes", cex.names = .7)

ggplot(t1, aes(x = re, y = p)) +
  geom_bar(stat = "identity") +
  labs(title = "Barplot: Proportion Diagnosed with Diabetes", x = "", y = "Proportion") +
  theme_minimal() +
  theme(
    panel.border = element_rect(fill = NA),
    panel.grid = element_blank()
  )

5.3.2.2 Categorical

In this plot, the other conditional proportion is plotted.

\[ P(\text{re}\,|\,\text{dx}) \]

Base
ggplot

t1 <- table(nhgh$dx_f, nhgh$re)
m <- rowSums(t1)  
t2 <- t1/m
colnames(t2) <- colnames(t2) |> gsub(" ", "\n",x=_)
colnames(t2)[5] <- "Other Race\nIncluding Multi-racial"
cols  <- c(rgb(1,0,0,1/4),rgb(0,0,1,1/4))

barplot(t2, beside=TRUE, ylim = c(0,max(t2)*1.1), main = "Grouped bar plot: Race and ethnicity proportions by diagnosis status", col = cols, cex.names = .8)
box()
legend("topright", rownames(t2), pch = 15, col = cols, bty = "n")

t3 <- as.data.frame(t2)
custom_colors <- c("Yes" = rgb(1,0,0,1/4), "No" = rgb(0,0,1,1/4))


t3 |>
  setNames(c("Group","Category","Value")) |>
  ggplot(aes(x = Category, y = Value, fill = Group)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  scale_fill_manual(values = custom_colors) +
  labs(title="Grouped bar plot: Race and ethnicity proportions by diagnosis status", x = "Category", y = "Value") +
  theme_minimal() +
  theme(
    panel.border = element_rect(fill = NA),
    panel.grid = element_blank()
  )

5.3.2.3 Continuous

Base
ggplot

h1 <- nhgh %>% 
  select(ht,sex) %>% 
  split(.$sex)

rb <- c(rgb(0,0,1,1/4), rgb(1,0,0,1/4)) 

hist(
    h1[[1]]$ht
  , freq = FALSE
  , col = rb[1]
  , breaks = 50
  , main = "Overlayed histograms: Height by sex"
  , xlab = get_label(nhgh$ht)
  , xlim = range(nhgh$ht)
  , ylim = c(0,0.06)
  , border = "white"
)
hist(
    h1[[2]]$ht
  , add=TRUE
  , freq = FALSE
  , col = rb[2]
  , breaks = 50
  , border = "white"
)
box()
legend("topleft", legend = names(h1), pch = 15, col = rb, bty = "n")

ggplot(nhgh, aes(x = ht, fill = sex)) +
  geom_histogram(
      aes(y = after_stat(density))
    , bins = 100
    , position = "identity"
    , alpha = 0.25
  ) +
  scale_fill_manual(values = rb) +  
  labs(title = "Overlayed histograms: Height by sex", x = get_label(nhgh$ht), y = "Density") +
  theme_minimal() +
  theme(
    panel.border = element_rect(fill = NA),
    panel.grid = element_blank() 
  ) +
  guides(fill = guide_legend(title = "Sex"))

Base
ggplot

boxplot(ht ~ sex, data = nhgh, col = rb, ylab = get_label(nhgh$ht), main = "Boxplots: Height by sex")

ggplot(nhgh, aes(x = sex, y = ht, fill = sex)) +
  geom_boxplot() +
  scale_fill_manual(values = rb) +
  labs(y = get_label(nhgh$ht), x = "Sex", title = "Boxplots: Height by sex") +
  theme_minimal() +
  theme(
    panel.border = element_rect(fill = NA),
    panel.grid = element_blank()
  ) +
  guides(fill = "none")

5.3.3 Covariation

5.3.3.1 Scatter plot

Base
ggplot

arml_label <- get_label(nhgh$arml)
leg_label <- get_label(nhgh$leg)

plot(nhgh$arml, nhgh$leg, xlab = arml_label, ylab = leg_label, main = "Scatterplot")

Add transparency.

plot(nhgh$arml, nhgh$leg, xlab = arml_label, ylab = leg_label, pch = 16, col = "#00000010", main = "Scatterplot")

nhgh |> 
  filter(!is.na(arml)) |>
  filter(!is.na(leg)) |>
  ggplot(aes(x = arml, y = leg)) +
  geom_point() +
  labs(title = "Scatterplot", x = arml_label, y = leg_label) +
  theme_minimal() +
  theme(
    panel.border = element_rect(fill = NA),
    panel.grid = element_blank()
  )

Add transparency.

nhgh |> 
  filter(!is.na(arml)) |>
  filter(!is.na(leg)) |>
  ggplot(aes(x = arml, y = leg)) +
  geom_point(color = "#000000", alpha = 0.125, shape = 16) +
  labs(title = "Scatterplot", x = arml_label, y = leg_label) +
  theme_minimal() +
  theme(
    panel.border = element_rect(fill = NA),
    panel.grid = element_blank()
  )

5.4 Combo table and figures

sparkline::sparkline(0)

des <- Hmisc::describe(nhgh)
print(des, 'continuous')

Variable	Label	Units	n	Missing	Distinct	Info	Mean	Gini \|Δ\|	Quantiles .05 .10 .25 .50 .75 .90 .95
`nhgh` Descriptives
15 Continous Variables of 22 Variables, 6795 Observations
seqn	Respondent sequence number		6795	0	6795	1.000	56872	3515	52125 52626 54246 56873 59511 61095 61627
age	Age	years	6795	0	813	1.000	44.29	23.74	14.33 16.75 25.67 43.75 61.33 74.08 80.00
wt	Weight	kg	6795	0	1022	1.000	79.37	23.77	49.97 54.60 64.00 76.30 91.10 107.10 119.00
ht	Standing Height	cm	6795	0	532	1.000	167	11.68	150.8 153.9 159.6 166.6 174.5 180.5 184.2
bmi	Body Mass Index	kg/m²	6795	0	2389	1.000	28.32	7.481	19.15 20.58 23.43 27.29 31.88 37.23 41.00
leg	Upper Leg Length	cm	6564	231	232	1.000	38.41	4.365	31.92 33.40 36.00 38.40 41.00 43.30 44.70
arml	Upper Arm Length	cm	6616	179	166	1.000	36.87	3.152	32.5 33.4 35.0 36.8 38.8 40.5 41.6
armc	Arm Circumference	cm	6607	188	320	1.000	32.49	5.899	24.50 26.00 28.85 32.10 35.60 39.30 41.70
waist	Waist Circumference	cm	6556	239	820	1.000	96.25	19.15	70.80 74.95 83.50 95.30 106.90 118.50 126.50
tri	Triceps Skinfold	mm	6314	481	354	1.000	18.79	9.497	7.0 8.4 12.0 17.9 25.0 31.0 33.8
sub	Subscapular Skinfold	mm	5824	971	343	1.000	19.96	9.587	7.7 9.1 13.0 19.4 26.2 31.8 35.0
gh	Glycohemoglobin	%	6795	0	99	0.995	5.677	0.799	4.8 5.0 5.2 5.5 5.8 6.4 7.2
albumin	Albumin	g/dL	6706	89	28	0.991	4.274	0.3635	3.7 3.9 4.1 4.3 4.5 4.7 4.8
bun	Blood urea nitrogen	mg/dL	6706	89	59	0.995	12.92	5.572	6 8 9 12 15 19 22
SCr	Creatinine	mg/dL	6706	89	205	1.000	0.8786	0.2907	0.55 0.60 0.70 0.83 0.98 1.14 1.28

5.5 Interactive visualizations

suppressPackageStartupMessages(require(Hmisc))
suppressPackageStartupMessages(require(plotly))
options(grType='plotly') 
plot(summaryM(ht + wt + SCr + bun ~ sex, data=nhgh))