Week 4: Dataset Formatting
Ishan B -
This week, I have been focusing on formatting the datasets from the US Census into the form that regression models use for their training. This is a challenging task as regression models need numerical data, and the county names are words. One strategy I may use to allow my program to still output the county names as names and not numbers is to duplicate the dataset, one as the initial dataset and the duplicated version with all columns as numbers, and create a dictionary that stores which county is which number. This will allow for the regression model to train correctly, and once the output is provided from the model, the code can output the county name based on the number that the model outputs.
Challenges with the dataset
The datasets that I obtained from the US Census website have formatting which the code does not inherently understand, such as description paragraphs at the top of the file and a table starting at row 3. Currently, when I import the dataset into my code and convert it into a pandas dataframe, the dataframe has empty column headers and the first 2 rows of the dataframe are empty. Column headers are crucial to the training of machine learning models as each column is a factor that the model is trained on, so to tune the model, the headers need to have the correct names. One of my ideas for fixing this is duplicating the dataset with the column headers being the names from row 2 and deleting any NaN (empty) values.
Comments:
All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.