Classifying Transportation POIs Using SafeGraph Patterns Data
Learn how to classify train stations, bus stops, and airports using visit and popularity metrics
Pranav Thaenraj’s paper serves as an introduction to the classification of POI categories using SafeGraph Patterns visit data. You’ll learn how you can use dwell time, visitor distance from home, popularity by hour of the day, popularity by day of the week, and more to classify transportation services.
To show how this can be done, they classify POIs into the following categories: bus and other vehicle transit systems, rail transportation, and airport operations. This can be done by categorizing POI data from airports, train stations, and bus stops.
First, Thaenraj covers how to set up dependencies for the project, and which projects you’ll be using (Pandas, NumPy, Seaborn, Matplotlib, and Pyspark). Then you’ll load the data, which will result in a table that displays your data and its attributes.
As you can see, a range of data is included such as raw visit counts, raw visitor counts, visits by day, distance from home, medium dwell time, and more!
Now that you’ve loaded the data, you’ll need to clean it to save yourself time and improve efficiency (and speed) while processing and working with your data. Start by removing unnecessary columns, so you have only the data that you need for your analysis.
Beyond that, some data may need to be cleaned further. In this case, the bus stop NAICS category consists of “Bus Stop and Other Transit Services”, which means it doesn’t just contain POI data for bus stops, but also truck rentals, yacht services, and more. In order for this data to be cleaned for Thaenraj’s purposes, they need to clean the data to filter out all transportation services that are not bus stops. Once you identify them, these can be removed from your records.
Since Thaenraj needed 3 unique classes for this study, a multiclass classifier was required. To ensure accuracy in their analysis, Thaenraj used three different methods of classifying: Gaussian Naive Bayes, Decision Trees, and the K-Nearest Neighbors Classifier. In all of these instances, you can run the function, and then visualize data on a heatmap for faster analysis.
After running these functions, we see that the Gaussian Naive Bayes model was not very successful at classifying this dataset, with an accuracy of about 26%. In contrast, the Decision Tree model had an accuracy of 75%. The final method used, K-Nearest Neighbors, had an average accuracy of approximately 68%, while it did a much better job of classifying airports (nearly 94%). As you can see, the method you use plays an important role in the success you’ll find classifying datasets.
Check out the original paper to read the results for yourself: Train, Bus, or Plane? Predictive Classification of Points of Interest Using Visit and Popularity Metrics
Pranav Thaenraj; Community Data Scientist at SafeGraph
You can also connect with other data scientists and researchers by joining the SafeGraph Community. There, you can share ideas, gain inspiration, and help each other grow!