Statistical Modeling: The Long Road to Confident Precision

Caleb Elgut
5 min readJul 29, 2020

Modeling is an overwhelming topic to learn about in data science, especially if you are relatively new to the subject. It is a process that takes a great deal of time, effort, and, most importantly, patience. It is a process during which you experience a great deal of error and inaccuracies while trying to understand what causes something to occur. This something often referred to as a dependent variable, can vary greatly. It can be relatively innocuous and easy to understand, like the price or, perhaps, the mpg of a car; however, it can also be incredibly serious, such as the presence of a virus raging around the globe.

As a student working on a recent simulation (my second project at The Flatiron School), I had the incredible opportunity to create, recreate, and recreate about a dozen or more times, statistical models to predict the prices of houses in King County, Washington. I worked with a data set of over 22,000 homes sold in King County between 2014 and 2015. The data contained a series of variables (often referred to as “features”). It was my job to understand how to use these variables to predict my target variable, price accurately. I will touch on some of the technical details here, but for more information, you can visit my GitHub repository and read through the readme or examine the notebooks: https://github.com/Kaleguts/real-estate-analysis

This project took an incredible amount of time and effort! I began with a baseline model created using a method called Ordinary Linear Regression (often referred to as “linear regression”). This method corresponds to minimizing the sum of square differences between the observed and predicted values. I won’t go deeper than the definition in explaining the math but, suffice it to say, this baseline model wouldn’t give me much initially as I had not done any preprocessing to my data. The data was raw and new. It would need to be adjusted to account for its skew as well as its lack of normality. I needed to account for different variables with an incredible variety in scale (number of bedrooms vs. the square feet of the lot of your house, for example, have very different scales of measurement).

The baseline model was the model upon which I built the rest of my models. I will touch on the specifics of these models in a moment, but I want you, reader, to understand the importance of this aspect of data science as it extends to many areas of life — some of which may impact you directly. Yes, perhaps, when you think of data science, you think of folks working for boards of corporations, aiding them in their quest for better ROI or, maybe, giving a marketing department information on which campaign best suits their current goals. However, data science is also a significant element of public health, medicine in general, and space technology! A classmate of mine mentioned that his friend, who works in the data science department for a very reputable firm, has been working on a mini-helicopter that will, one day, fly to Mars.

Here is an early version of my voyage into modeling:

After some preprocessing and normalization of data, I attained the above model — this was my third version! The black dashed line is what my model predicts will happen, and the blue dots are the real data. At this point in my work, my model was very much still in progress. Ideally, the blue dots would gather around, or even line up with, my line. Furthermore, my margin of error was incredibly high — this model, when it predicted price, would be off by about $200,000. No client would want to work with this!

After a great deal of time and work, returning to the drawing board, again and again, I eventually came up with a significantly more accurate model than any of my previous ones! I added and dropped features until the accuracy improved, and I even created other features myself. I used the latitude and longitude to break King County up into sectors or using zipcode’s relationship with a price to create two columns, one with the ten highest priced zipcodes and another with the ten lowest-priced zipcodes.

My final model for the project stands below:

This is much better! You can see how the blue dots gather around the dashed line. This reflects a very accurate model. It came as a result of a great deal of feature engineering, accounting for multicollinearity (issues of confounding variables), further normalization, and some old-fashioned trial and error.

In many situations, data science extends to the private sector, where clear deadlines are given months in advance. Many models can be built and rebuilt until you have an incredibly high level of confidence with a low margin of error. Other times, particularly in the world of Public Health, one does not have the luxury of deadlines. The models one comes up with are the result of a team of highly qualified individuals, but if these scientists are dealing with a disease that is spreading rapidly through an entire nation, they will, perhaps, have a good but not great model. This model will give them the answer they need at the moment; however, as time progresses, these scientists do not just sit, proud of the model they created in month 1 of a disease spread. They continue to work and refine their models, especially as new information adds greater context to their data! Recommendations may change, but, generally, it is due to incoming information. That’s science, baby! It is a process of hypothesis testing and retesting.

When not hurried, a statistical model is continuously refined behind closed doors until it is relatively pristine and ready for presenting! In these moments, scientists may appear to be magicians, but nothing could be further from the truth. Any scientific innovation in our society, be it the color TV, IMAX, or the electric car, comes from years of testing and retesting until the product is ready for release. When a group of the world’s smartest scientists is thrown into action, they immediately start working on finding solutions to work through the crisis. Often this includes determining what will come next and what procedures and recommendations are best to respond to the crisis as it stands now!

As the months pass, the recommendations may change, but that is because the situations change, and more data becomes available. I understand that science’s very nature may make it untrustworthy if you are not familiar with it. Hypothesis testing and new conclusions leave many with questions about why the previous solution is now adjusted. Increased information often leads to change, but the change is always in the direction of the public good, particularly in the world of public health and the data analysis that comes from it. Those who create the models for prediction adjust their data so there can be more accurate and reliable predictions that can better inform a direction for solutions.

Trust your doctors and data scientists. They are continually working to make this world a better place for you.

--

--