Week 5: Feature Engineering

Ishan B -

Last week, I discussed the challenges I had with setting up the dataframe. The first challenge of changing the county names to numbers and assigning those numbers to the names was relatively easy. The other issue was a bit more of a challenge, but using inbuilt pandas methods, specifically iloc, I was able to loop through row 3 and set each element to the column header.

This week, I am focusing on feature engineering, a part of machine learning where you find correlation between the input and output elements to decide which features are important to the training of your model. This can help get rid of any features that are unnecessary which could confuse the model during training. Feature engineering is not a crucial part of training a machine learning model, but I believe that this extra step can help enhance the model and improve its accuracy. I may also generate new features (columns) into my dataframe by using new existing columns. Sometimes, the relationships between columns are more important than the columns themselves, so by defining the relationship yourself, the model can train better and provide better results. Some columns which I believe this will be helpful in are the columns that describe number of people with “less than a high school diploma” or “high school diploma” and comparing that to the population of the county to generate what percent of people have a high school diploma. That also applies for bachelor’s degrees or higher and some college or associate’s degrees.

 

More Posts

Comments:

All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

    rohan_va
    Hi Ishan, feature engineering sounds like a fascinating concept! How do you gauge the correlation between input and output elements - is there a specific statistical test you use?
    adam_b
    Hi Ishan, this is a fascinating topic! Do you think a model like this could have applications in other fields?
    ishan_b
    Thanks for the question Rohan! There is a built-in function in pandas (.corr()) which outputs a number between -1 and 1. The closer the number is to 1, the higher the correlation is, meaning that the feature has more importance.
    ishan_b
    Thanks for the question Adam! Models like this are useful in many different applications that use tabular data. A big field where these types of models are starting to be applied is finance.

Leave a Reply

Your email address will not be published. Required fields are marked *