Behind the Boxplot
Aaryan s -
Hi all!
I finally reached the point I had been working toward: visualizing and analyzing the differences in reimbursement timelines between public and private insurance. While I expected some variability, the results were more striking than I anticipated—and they challenge some common assumptions about efficiency in the American healthcare system.
I began by organizing the data into two datasets: one for claims processed through private insurance and one for claims submitted to public insurance, specifically Arizona Complete Health under Medicaid. Each dataset included key columns like service type, date of service, total days until payment, claim amount, and payor. After ensuring the datasets had matching structures, I merged them into one master dataset in R, labeled each claim by insurance type, and used that combined data to generate boxplots and run statistical tests.
The boxplot comparing reimbursement times by insurance type immediately revealed key differences. Not only was the median reimbursement time for private insurance higher than for public insurance, but private claims also had a much wider interquartile range and an enormous number of outliers, some of which extended over 600 days. Public insurance, on the other hand, showed a more compressed distribution — slower on average, but more consistent and predictable.
The histograms provide additional clarity to the patterns revealed in the boxplot. The private insurance histogram shows a sharp peak around 30–60 days but features a long, heavy tail stretching beyond 600 days. This highlights a significant number of extreme outliers — claims that took far longer than average to be reimbursed. In contrast, the public insurance histogram also skews right but is more compact, with most claims clustering below 100 days and fewer extreme delays. Together, these visualizations emphasize that while both systems are skewed, private insurance reimbursement times are not only slower on average but also far more erratic and prone to extended delays.
To test whether these differences were statistically significant, I used the Mann-Whitney U test, which is appropriate for skewed data like mine. The result was unambiguous: W = 102,946,508, p < 2.2e-16, indicating a highly significant difference in reimbursement times between private and public insurance claims. With a p-value this small, it’s clear that the observed differences are not due to random variation.
This challenges the common narrative that private insurance is inherently more efficient because it’s driven by market forces. In fact, my data suggests that private insurance may introduce more administrative complexity and delay. Prior authorizations, claim rejections, and fragmented plan structures could be contributing to the longer wait times. Public insurance, although often stereotyped as slow and bureaucratic, displayed a steadier and more standardized pattern.
Next steps in my project include stratifying the data by procedure type and claim amount to see whether these trends hold across different services. I’ll also explore whether outliers have identifiable causes and consider trimming the dataset to analyze “clean” claims separately. For now, though, it’s clear that the type of insurance matters — not just for what gets covered, but for how long providers have to wait to get paid.