Technical Report

Executive Summary

This project builds a custom dataset using the National Park Service (NPS) API by combining parks, alerts, and campground data into a unified dataset. The goal is to explore how park amenities relate to operational complexity across National Park Service sites.

The final dataset contains 474 park units and includes engineered features such as number of activities, alerts, and campgrounds. Analysis suggests a weak positive relationship between amenities and operational complexity.


Motivation

The National Park Service manages a wide range of park units that vary significantly in size, infrastructure, and usage. Understanding how park features relate to operational demands can provide insight into how resources are distributed and managed.

Rather than using a pre-existing dataset, this project builds a dataset from scratch to allow full control over feature selection and data structure.


Data Collection

Data was collected from the National Park Service API using three endpoints:

  • Parks
  • Alerts
  • Campgrounds

Each endpoint provides different information about park units. Data was retrieved using Python and stored as raw CSV files.


Methodology

Data Acquisition

API requests were made using the requests library. Each endpoint returned JSON data, which was converted into pandas DataFrames.

The parks dataset served as the base dataset, while alerts and campgrounds were merged using park codes.

Data Cleaning

Key cleaning steps included:

  • Removing missing values
  • Standardizing column names
  • Converting data types
  • Selecting relevant variables

Feature Engineering

New variables were created to support analysis:

  • description_length: length of park description
  • num_activities: number of listed activities
  • num_alerts: number of alerts per park
  • num_campgrounds: number of campgrounds per park

Analysis

The primary question explored is the relationship between:

  • Amenities (activities and campgrounds)
  • Operational complexity (alerts)

A scatterplot of activities versus alerts shows that most parks have low alert counts regardless of activity level. However, parks with more activities tend to have slightly higher alert counts.

A simple linear trend line suggests a weak positive relationship.


Key Findings

  • Most parks have low numbers of alerts
  • There is only a weak relationship between amenities and alerts
  • Parks with more infrastructure tend to show slightly higher operational complexity

Limitations

  • The dataset is a snapshot in time and does not update automatically
  • Alerts and campgrounds were limited to 500 records
  • Park units vary widely in type and scale, which affects comparisons

Reproducibility

The full workflow can be reproduced using scripts in the src/ directory:

  • get_parks.py
  • get_alerts.py
  • get_campgrounds.py
  • clean_parks.py
  • merge_alerts.py
  • merge_campgrounds.py

The final dataset is stored at:

data/processed/parks_final.csv

Conclusion

This project demonstrates a full data science workflow, from data collection to deployment. It highlights how API data can be transformed into a structured dataset and used to explore meaningful relationships between real-world variables.