Our first solo project for the Metis data science bootcamp would be a prediction model. Linear regression. I looked around my room and saw my collection of Funko Pop! Figures. Even though I tried not to fall into the rabbit hole of collecting, my initial focus on the Seattle Seahawks, Green Lantern, and Scarlet Witch only had grown drastically over the years. I was aware of the Pop Price Guide (PPG) when searching for figures in the past. So I had a topic that interested me and a source to get some features. I thought it was too much of a niche topic and that it may not be taken seriously, and continued searching for alternatives. But as I continued searching, I kept thinking about the potential of a Funko Pop! Project. I definitely enjoyed the topic and therefore was more likely to enjoy the project. Still reeling at the beginning of the bootcamp, the sooner I decided on a project, the more work I could get done. Funko was a company I admired, and who knows, maybe they would see this project someday. As the desire to do the project grew and grew, I took the leap and planned my next steps.
Who could even use my predictor? As with many products, I personally fall into the category of potential users. I don’t have unlimited funds to buy large amounts of figures, nor could I compete to get the exclusive ones at conventions. Then there’s the issue of where to put them all. I’m currently at the point of just putting my entire collection into a box. My shelves are overwhelmed and there are some who don’t see the light of day anyways. It’s costly to store large collections, so my predictor could be useful for better buying decisions. But it’s not just buying decisions, being able to predict prices might help inform selling decisions, whether to hold onto a figure for a rainy day.
One of our first lectures was on web-scraping. I wanted to practice this skill and first focused on PPG. I had wanted to see if there were additional data sources I could use, but seeing that Funko’s mobile app also drew its price data from PPG, I settled on focusing solely on PPG. After determining that the pages had to be loaded dynamically, I decided that I would have to use Selenium of the scrapers we learned. I then went on to plan out how to scrape what I needed. After much trial and error, and some false starts, I ran my scraper overnight to not burden the site nor interfere with lectures. PPG has now switched to a scroll to get more results, but at the time of my project, they were listed ten figures a page. Starting from the page filtered to only “Vinyl Pops”, I had my scraper go through each page and grab the link for each figure, and in the cases of figures with variants, to dig into the page listing the variants, and grabbing those links. Afterwards, I would then take the list of links and pull the values for the various features I was after.
I had tried to pull the specific feature values I had chosen during my scraping. Each page listed what series the figures may have been a part of (ex. TV Shows, NFL, Funko’s mascot Freddy Funko). In addition, details such as the dimensions of the figure, and properties, such as if they were glow-in-the-dark, and sometimes production numbers were also available on the pages.
I was still relatively new to using Python and Pandas, so data cleaning took a lot of my time as I learned by making mistakes until I got what I wanted done. For loops abound, but results were more important than efficiency at this point of my journey. I definitely came back to cleaning over and over as I found hiccups along the way, such as years not in time format, finding more missing values, or flattening my category list of lists. By the end of my project, I knew I would have to come back to continue cleaning the data.
With a singular model to focus on, Linear Regression, the biggest hurdle I had to face was to figure out which features to use for my model. This was where my difficulties understanding and coding what I wanted to do came into play. Without being able to properly use GradientSearchCV and GridSearchCV, I went ahead and manually added and removed features as I did cross-validation, trying out my own polynomials and interaction terms. Using LASSO regularization to give me some ideas, I was able to come to my final model.
Our instructor had said sometimes real world models may have R scores of 0.6. My train scores started in the high 20s and the one I ended up with were mid-40s. I ran it on the test set and got 0.22. Well, that means my model could only explain so much variance. It should’ve made sense, the majority of my features were categorical features that were binarized. How could that really tell you a price? After I slept on it, I was certain I would have to revisit this project. Time-series data and analysis was an extra level that introduced difficulties that I had initially avoided and was a topic we would learn on. The price of the figures from a month ago, three months, even a year would serve as perfect starting points. It was what was missing to help determine a $10 figure from a $13,000 figure. I knew once I have the time, I could improve the prediction of my model, and look forward to doing so.