šŸ„
-
Counties Analyzed
šŸ“ˆ
-
Overperforming
šŸ“‰
-
Underperforming
šŸŽÆ
-
Model Accuracy (R²)

Life Expectancy: Actual vs Predicted

-8 years
(Worse than predicted)
As Expected +8 years
(Better than predicted)

Green counties are healthier than their socioeconomic factors predict. Red counties perform worse than expected. Click any county for details.

Prediction Model

Type: Random Forest Regressor

Target: Life Expectancy

Accuracy (R²): Loading...

The model predicts each county's life expectancy based on 9 socioeconomic and health factors. The map shows where reality differs from predictions.
We chose a Random Forest because it captures nonlinear relationships without requiring feature scaling, making it easier to interpret differences in real world terms (years of life expectancy).

Design Rationale & Development Process

1. Design Rationale

Context and Purpose

Economic inequality shapes life expectancy in measurable ways, but national averages hide local nuance. This project invites the public to explore which U.S. counties exceed or fall short of expectations and provide exploration of why. Designed for public health students, policy analysts, and curious citizens, this visualization transforms raw health data into a story about resilience and disparity.

Why a Deviation Map?

We chose to implement a residual/deviation map rather than a standard choropleth because it directly addresses our research question: "Which counties defy economic predictions?" A traditional map showing just life expectancy or income would simply reveal the well-known correlation between wealth and health. Our approach reveals something more interesting: the exceptions to this rule.

Visual Encoding Decisions

  • Diverging color scale (red-white-green): We use a perceptually uniform diverging scale to encode positive (green) and negative (red) deviations from predictions. White represents counties performing as expected. This encoding immediately draws attention to outliers while maintaining clarity for the majority of counties near the center.
  • Geographic map layout: Preserves spatial relationships essential for identifying regional patterns and enables users to locate their own communities.
  • Quantitative scale: Deviations are measured in years of life expectancy, making the stakes concrete and relatable.

Interaction Techniques

  • Details-on-demand (tooltips): Hovering reveals county name, actual vs predicted values, and deviation amount without cluttering the map.
  • Modal drill-down: Clicking opens a detailed view with all 12 metrics, allowing users to investigate why a county might be outperforming or underperforming.
  • Dynamic filtering: Users can isolate overperformers, underperformers, or expected performers to focus their exploration.
  • View switching: Toggle to standard single-metric view for comparison and validation of our model-based approach.

Alternatives Considered

We evaluated four approaches (see checkpoint documentation):

  • Bivariate choropleth: Would show two metrics simultaneously but requires complex 2D color scales and doesn't directly leverage our ML model.
  • Linked multi-view: Would enable brushing across map + scatter plot but risks overwhelming users and increased development complexity.
  • Clustering/archetype map: Would reveal county types but loses granular quantitative information about deviation magnitude.

The deviation map was selected because it best balances analytical power, interpretability, and direct relevance to our research question.

Design Inspirations

Our approach draws from regression diagnostic visualizations (residual plots) but applies them to geographic data. Similar techniques appear in election forecasting ("over/underperforming polls") and real estate analysis ("above/below market rate"), but are rare in public health visualization.

Interpretation and Ethical Considerations

The labels ā€œoverperformingā€ and ā€œunderperformingā€ describe statistical deviation, not moral or cultural judgment. County level data can mask disparities within counties; a region that appears ā€œhealthyā€ overall may still contain underserved communities. The visualization should therefore prompt inquiry, not outright ranking. The color scale communicates direction clearly but simplifies complex realities, so context from socioeconomic history remains essential for interpretation.

2. Development Process

Team Workflow

Team Member Primary Responsibilities Estimated Hours
Harsh Arya Data cleaning, Random Forest model implementation, feature engineering 15 hours
Gabrielle Despaigne Exploratory analysis, color scale optimization, documentation, testing 16 hours
Camila Paik D3.js map implementation, TopoJSON integration, interaction handlers 20 hours
Raghav Vasappanavara UI/UX design, CSS styling, modal components, responsive layout 16 hours

Total effort: ~67 person-hours over 2 weeks

Technical Challenges

  • Data processing (8 hours): The County Health Rankings Excel file required extensive cleaning—column names varied across years, percentage encoding was inconsistent (some 0-1, some 0-100), and ~15% of counties had missing data for at least one metric. We implemented median imputation for model training.
  • Model integration (5 hours): Experimentation with Random Forest model implementation and optimization for 3,159 counties. Pre-computed predictions in Python and exported to JSON for efficient client-side rendering.
  • Map rendering performance (6 hours): Rendering 3,159 county paths caused lag on hover interactions. Optimized by simplifying TopoJSON geometry and using CSS transforms instead of re-rendering on hover.
  • Color scale design (4 hours): Finding a diverging scale that was colorblind-accessible, perceptually uniform, AND intuitively mapped to "good/bad" required testing multiple ColorBrewer palettes. Settled on RdYlGn with adjusted endpoints.

Tools & Technologies

  • Data processing: Python (pandas, scikit-learn, openpyxl)
  • Visualization: D3.js v7, TopoJSON
  • Frontend: Vanilla JavaScript (no frameworks), CSS Grid/Flexbox
  • Deployment: GitHub Pages

What Took the Most Time?

Surprisingly, data wrangling consumed nearly 30% of our time despite using a "clean" public dataset. The County Health Rankings data is comprehensive but not designed for direct machine learning use—it required significant preprocessing. The second largest time sink was interaction polish (tooltips, modals, smooth transitions), which took longer than the core map rendering.

Lessons Learned

  • Pre-compute expensive calculations (ML predictions) during data prep, not in-browser
  • Start with simplified geometry (TopoJSON compression) for large geodata
  • User testing revealed that our initial deviation thresholds (+/- 2 years) were too sensitive—adjusting to +/- 1 year made patterns clearer
  • Accessibility features (keyboard navigation, ARIA labels) should be built in from the start, not retrofitted

3. Future Enhancements

Given more time, we would add:

  • State-level aggregation view for mobile users (county-level too detailed on small screens)
  • Time-series animation showing how deviations change from 2020-2025
  • Exportable county comparison tool (select multiple counties, download PDF report)
  • Integration with Census data for demographic breakdowns within counties

4. Data Sources & References