The State of Solar: A Bright Future

Ira Evangelista

By: Ira Evangelista (re-posted with his permission)
September 27th, 2019 · 5 min read

Exploratory Analysis: the news is good.

Aside from three out of the nineteen years observed in the data that I used for this project, there has consistently been an increase in the installation of solar energy systems across the United States.

Fig. 1: Year, number of solar installations, and year-over-year growth.

At first glance, it looks like the best years of growth happened in the late 1990s and early 2000s, but that’s only because the amount of installations occurring in the observed (25) states originated from such a low baseline.

When we measure the number of solar systems installed on a yearly basis, we get a better picture of how much adoption has occurred — and how rapidly.

While solar energy by itself will not be the solution to ensuring that the United States and the world transitions into sustainable energy, it is a big piece of the puzzle.

Diving into the data itself was so much fun, and I wish I had more time. Below are some of the key takeaways.

Figure 2: Overall percent change in the United States of quantity of new solar installs, compared to the year before (or, year-over-year)
Figure 3: Quantity of solar installations in the United States according to year
Figure 4: Total quantity of solar installations per state in the United States from 1998-2017

Predictive Modeling.

For our prediction model, we chose “System Size” as our target. Which means that given the model that we choose to implement, could we beat our baseline? As can be seen in the graph below in Figure 5, there’s a wide range of the sizes of the systems installed per state.

An interesting observation is that Arizona has one of the highest number of systems installed in the state, as well as the largest average kilowatt peak size per installation.

Figure 5: Average size of solar installations per state
Figure 6: Average size of the new solar installation being installed, according to customer segment. To elaborate, residential solar energy generation installations are the smallest installs, whereas commercial installations were, on average, the largest installs.
(Note the NaN (-9999) values and le7 on the bottom right corner, both of which I wish I had had extra time to finish cleaning before my project was due)
Figure 7: The sum of the new solar installation sizes, according to customer segment.
(Note the NaN (-9999) values and le7 on the bottom right corner, both of which I wish I had had extra time to finish cleaning before my project was due)

The Method.

Making this project proved harder than I thought, but I learned a tremendous amount throughout the process. I’m glad I got the chance to stretch my legs after seven weeks of class lectures and apply what we’ve learned through Units 1 and 2.

For this project, I wanted to see whether or not we could create a model that would predict the size of the solar panel installations. System sizes are measured in kilowatt peak, which is numeric. These could come in any size depending on the need of the residence, commercial building, school, etc. Hence, that’s why I selected a Linear Regression model.

At nearly 1.2 million rows and 63 features, the data set is huge.

The first thing I did was to run the model while keeping all of the features. Results are listed below. I’ve included a heatmap of that first minimum viable product (MVP) model to give a picture on how many features there were, and highlight the proportion of multicollinearity that existed in the dataset’s features.

Baseline(mean): 22.22
MAE for Baseline Prediction: 23.19
MAE for y_pred validation: 104.65
MAE for y_pred test: 81.04
MSE on baseline: 393,955.61
MSE on y_pred val: 404,288.31
Figure 8: Don’t worry, the first time I saw a heat-map I was also overwhelmed. It simply shows the different r-values (or, correlations) between any two given aspects of data.

After removing features that had high P-values on my OLS regression, and picking and choosing (using domain expertise) which features to drop which had high multicollinearity on the heatmap, I ended up with about one-fourth of the features I started with (fourteen out of sixty-three).

Figure 9: After feature engineering, and removing features with high multi-collinearity
Fig. 10: OLS Regression — helpful in identifying features to remove & keep for our model. See highlighted boxes. Note the R² score — I ran a an r2_score() function after that to get a second opinion (the 0.298 score below).

In the end, after five iterations of removing unnecessary features, and changing my encoding method from Ordinal to OneHotEncoding, I was able to beat my MAE baseline score (from 23.19 to 16.94 on my y_pred test).

The final score that I was able to massage the model into giving me was:

Baseline (mean): 22.22
MAE for the Baseline Prediction: 23.19
MAE for the y_pred Val: 59.05
MAE for the y_pred Test: 16.94
MSE on baseline: 393,955.61
MSE on y predicted value: 393,267.74
R^2: 0.298


I’d love to understand better why my y-prediction validation score didn’t improve the same way my y-prediction test did when I made the changes I mentioned above.

The dataset has so many different features that, given more time, and experience with data wrangling, I’d love to dive deeper into: government incentives like rebates, tax credits, feed-in-tariffs (FIT), and performance based incentives.

I’d also love to do an animated choropleth that allows me to see the growth year-over-year in solar system installations. I made an attempt, but due to the size of the dataset, I couldn’t move forward with what I wanted to accomplish. And when I tried to solve the problem by doing a groupby function, I didn’t quite get the results I wanted on the map.

Interesting note: my OLS regression (Figure 6) gave me an R² of 0.000. So I did a separate r2_score() test to get a “second opinion” on that reading. Interestingly enough, as I made optimizations to my model, the R² score improved just like my MAE scores did.

I’ll leave this as an artifact on how much my python and analysis skills will grow over the months and years to follow as I continue on this journey into data science.

Additional items on my wishlist (aka things to improve upon):

  • Include new features: annual average electricity prices by state, cost of solar energy systems by cost per kilowatt peak.
  • Stacked bar plots of customer segments through the years
  • State by state breakdown of on size of customer segments

Throughout the history of this blog I’ll touch upon the subject of energy quite often. So I’m glad that I was finally able to release my first entry on this subject!


  • Tracking the Sun (Berkeley Labs) dataset: [link]
  • Github link for my notebook: [link]

Write a Reply or Comment

Your email address will not be published.