Technical Report
Executive Summary
This project builds a custom dataset using the National Park Service (NPS) API by combining parks, alerts, and campground data into a unified dataset. The goal is to explore how park amenities relate to operational complexity across National Park Service sites.
The final dataset contains 474 park units and includes engineered features such as number of activities, alerts, and campgrounds. Analysis suggests a weak positive relationship between amenities and operational complexity.
Motivation
The National Park Service manages a wide range of park units that vary significantly in size, infrastructure, and usage. Understanding how park features relate to operational demands can provide insight into how resources are distributed and managed.
Rather than using a pre-existing dataset, this project builds a dataset from scratch to allow full control over feature selection and data structure.
Data Collection
Data was collected from the National Park Service API using three endpoints:
- Parks
- Alerts
- Campgrounds
Each endpoint provides different information about park units. Data was retrieved using Python and stored as raw CSV files.
Methodology
Data Acquisition
API requests were made using the requests library. Each endpoint returned JSON data, which was converted into pandas DataFrames.
The parks dataset served as the base dataset, while alerts and campgrounds were merged using park codes.
Data Cleaning
Key cleaning steps included:
- Removing missing values
- Standardizing column names
- Converting data types
- Selecting relevant variables
Feature Engineering
New variables were created to support analysis:
description_length: length of park descriptionnum_activities: number of listed activitiesnum_alerts: number of alerts per parknum_campgrounds: number of campgrounds per park
Analysis
The primary question explored is the relationship between:
- Amenities (activities and campgrounds)
- Operational complexity (alerts)
A scatterplot of activities versus alerts shows that most parks have low alert counts regardless of activity level. However, parks with more activities tend to have slightly higher alert counts.
A simple linear trend line suggests a weak positive relationship.
Key Findings
- Most parks have low numbers of alerts
- There is only a weak relationship between amenities and alerts
- Parks with more infrastructure tend to show slightly higher operational complexity
Limitations
- The dataset is a snapshot in time and does not update automatically
- Alerts and campgrounds were limited to 500 records
- Park units vary widely in type and scale, which affects comparisons
Reproducibility
The full workflow can be reproduced using scripts in the src/ directory:
get_parks.pyget_alerts.pyget_campgrounds.pyclean_parks.pymerge_alerts.pymerge_campgrounds.py
The final dataset is stored at:
data/processed/parks_final.csv
Conclusion
This project demonstrates a full data science workflow, from data collection to deployment. It highlights how API data can be transformed into a structured dataset and used to explore meaningful relationships between real-world variables.