Week 2: An Overview of Data Collection and Types of Regression

Akshita K -

Welcome back to my blog! This is Akshita again and this week, I have been exploring how I can collect and analyze data for my project. In this post, I want to introduce you to the database I will be using for my research, and the types of regression models that could be helpful in analyzing the data.

Data Collection

For my research, I will be using the IPUMS database, the world’s largest accessible database of census microdata. IPUMS USA contains nearly a billion records from U.S. censuses, including decennial censuses from 1790 to 2010 and American Community Surveys (ACS) from 2000 to the present.

From IPUMS, I will create a data extract of about twenty variables, including:

  • Demographics: Sex, age, race/ethnicity, education, years in the U.S. (for immigrants)
  • Employment & Income: Occupation, income, industry
  • Geographic Data: State of residence
  • Survey & Census Metadata: Census year, IPUMS sample identifier, household and person weights, group quarters status

By analyzing these variables, I can assess which groups may be most vulnerable to AI-driven job displacement and identify patterns across different industries and demographic groups.

What is Regression?

Simply put, regression is a statistical technique used to model relationships between a dependent variable (the outcome we want to predict) and one or more independent variables (the factors that may influence the outcome). A regression model helps determine how changes in independent variables affect the dependent variable by finding a best-fit line and analyzing how the data is distributed around this line.

For my research, the dependent variable will likely be the exposure to job displacement due to AI, while the independent variables will include the demographic and occupational characteristics.

Now, let’s look at some types of regression that could be useful to our research:

Types of Regression Models

Linear Regression

Linear regression is the simplest form of regression, where the relationship between the dependent and independent variables is modeled as a straight line (data is linearly related).

Logistic Regression

Logistic regression is used when the dependent variable is binary (e.g., Yes/No or 1/0). Instead of predicting a continuous outcome, logistic regression uses a sigmoid curve to estimate the probability of an event occurring. 

Logistic Regression

Polynomial Regression

Polynomial regression is used when the relationship between variables is non-linear. This allows for curved relationships by including higher-degree terms of independent variables.

Polynomial Regression

What’s next?

In the upcoming weeks, I will explore how I can apply these regression models to my own research and start analyzing my data in R. Stay tuned for more updates!

More Posts

Comments:

All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

    shriya_s
    Hi Akshita, I think you research is so relevant in today's age, and you are doing a great job studying it! How will you be measuring the exposure to AI-driven job displacement? Or, do you have operational definitions for AI-driven job displacement that can help you measure it objectively?
    aashi_h
    Hey Akshita! I love the idea for our project and how it is going so far. What is your inspiration for choosing this topic?
    akshita_k
    Thank you for your question, Shriya! Many prior studies have developed AI Occupational Exposure Indexes, which quantify various occupations' exposure to AI on a scale from 0 to 1. I plan to use an index from a 2024 study by the U.S. Treasury Department. Using the R programming language and data from the IPUMS USA database, I will assign each occupation an AI exposure value and run a regression analysis, with demographics (such as sex, race, age group, and education level) as independent variables and AI exposure as the dependent variable. This approach will help identify which groups are most affected by AI-driven job displacement. Additionally, I plan to compare AI’s impact across different industries (such as IT, healthcare, etc.) and whether white-collar or blue collar jobs are more impacted.
    akshita_k
    Hey Aashi, thank you for your interest! I think my inspiration for this project originates from both my prior experience studying AI and the ongoing discourse in the news about companies replacing workers with AI. Studying machine learning through an online course introduced me to its capabilities and sparked my curiosity about its societal implications. At the same time, seeing discussions about how certain demographics—especially marginalized groups—are more impacted by AI made me want to dive deeper into this issue.

Leave a Reply

Your email address will not be published. Required fields are marked *