Control
Rehan N -
Hey guys! This week I have been focusing on building the control dataset for my research. As a refresher, my project is on testing equitable performance of AI detectors like GPTzero for ESL (English as a Second Language) speakers.
The control dataset, which is a pool of AI generated texts that will be inputted into the AI detectors as a test, is integral to creating a baseline so that the other data sets have a benchmark accuracy to compare to.
To begin the creation of the dataset, I used the o1 model (since it has the most advanced reasoning capabilities at the time of writing) of ChatGPT starting with the prompt: “Create 50 realistic mock Reddit posts with a character count of at least 250 across different subreddits,” along with a file containing some examples.
Each post generated mimicked the style of online forum posts and matched the tone/structure that I’ll be analyzing in other datasets. The mix of topics and the varying lengths of text also ensured that the generations were as close to actual forum posts as possible.
One of the challenges this week was figuring out how ‘human-like’ these AI generated texts should read because these AI models can deliberately generate very high or poor quality writing. I needed to find middle ground with some variations that aren’t overly well-written and contain slang. Another challenge was achieving data consistency because the control group needed to embody features that AI detectors might struggle to identify, which took some trial and error before I could find a balance.
Already, whilst building the dataset, I’ve begun to notice some interesting trends where the more conversational sounding AI generated texts are flagged at a lower percentage than the other forum posts. These observations help validate my project at the early stage.
After I finalize this dataset, I’ll begin curating the rest of the datasets in native English and ESL posts to start running more tests. I’m looking forward to how the detectors do in the next steps!
