Week 2: An Overview of Data Collection and Types of Regression
Welcome back to my blog! This is Akshita again and this week, I have been exploring how I can collect and analyze data for my project. In this post, I want to introduce you to the database I will be using for my research, and the types of regression models that could be helpful in analyzing the data.
Data Collection
For my research, I will be using the IPUMS database, the world’s largest accessible database of census microdata. IPUMS USA contains nearly a billion records from U.S. censuses, including decennial censuses from 1790 to 2010 and American Community Surveys (ACS) from 2000 to the present.
From IPUMS, I will create a data extract of about twenty variables, including:
- Demographics: Sex, age, race/ethnicity, education, years in the U.S. (for immigrants)
- Employment & Income: Occupation, income, industry
- Geographic Data: State of residence
- Survey & Census Metadata: Census year, IPUMS sample identifier, household and person weights, group quarters status
By analyzing these variables, I can assess which groups may be most vulnerable to AI-driven job displacement and identify patterns across different industries and demographic groups.
What is Regression?
Simply put, regression is a statistical technique used to model relationships between a dependent variable (the outcome we want to predict) and one or more independent variables (the factors that may influence the outcome). A regression model helps determine how changes in independent variables affect the dependent variable by finding a best-fit line and analyzing how the data is distributed around this line.
For my research, the dependent variable will likely be the exposure to job displacement due to AI, while the independent variables will include the demographic and occupational characteristics.
Now, let’s look at some types of regression that could be useful to our research:
Types of Regression Models
Linear Regression
Linear regression is the simplest form of regression, where the relationship between the dependent and independent variables is modeled as a straight line (data is linearly related).
Logistic Regression
Logistic regression is used when the dependent variable is binary (e.g., Yes/No or 1/0). Instead of predicting a continuous outcome, logistic regression uses a sigmoid curve to estimate the probability of an event occurring.
Polynomial Regression
Polynomial regression is used when the relationship between variables is non-linear. This allows for curved relationships by including higher-degree terms of independent variables.
What’s next?
In the upcoming weeks, I will explore how I can apply these regression models to my own research and start analyzing my data in R. Stay tuned for more updates!
Comments:
All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.