Week 4: Dataset Formatting

Ishan B -

This week, I have been focusing on formatting the datasets from the US Census into the form that regression models use for their training. This is a challenging task as regression models need numerical data, and the county names are words. One strategy I may use to allow my program to still output the county names as names and not numbers is to duplicate the dataset, one as the initial dataset and the duplicated version with all columns as numbers, and create a dictionary that stores which county is which number. This will allow for the regression model to train correctly, and once the output is provided from the model, the code can output the county name based on the number that the model outputs.

Challenges with the dataset

The datasets that I obtained from the US Census website have formatting which the code does not inherently understand, such as description paragraphs at the top of the file and a table starting at row 3. Currently, when I import the dataset into my code and convert it into a pandas dataframe, the dataframe has empty column headers and the first 2 rows of the dataframe are empty. Column headers are crucial to the training of machine learning models as each column is a factor that the model is trained on, so to tune the model, the headers need to have the correct names. One of my ideas for fixing this is duplicating the dataset with the column headers being the names from row 2 and deleting any NaN (empty) values.

More Posts

Comments:

All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

    camille_bennett
    Hi Ishan, sounds like you are doing some great problem-solving! Has there been a specific resource or person helping you with navigating regression models? Was this something you had worked with before senior project?
    ishan_b
    Thanks for the question Ms. Bennett! The main person who has been helping me navigate the regression models is my mentor, Mr. Krystian Confeiteiro. He has provided me with books that describe the different kinds of regression models and how to use them. During our meetings, he has also helped me gain a better understanding of which regression models would be best for my use case. Before this senior project, I had only minimal experience with regression models, and I have already learned a lot through this project.

Leave a Reply

Your email address will not be published. Required fields are marked *