Week 3-A Supervised Approach to Training an Artificial Intelligence to Extract Relevant Genomic Data from Literature

Adam B - February 23, 2025 7:28 am

Hello everyone! This past week brought me deeper into both model comparisons and streamlining workflows for data extraction.

This week, I used two different AI models: ChatGPT o3, and a second, newer, and less well-known AI, Grok 3. Grok is an interesting AI made by X because it was trained on typical public documents as well as synthetic datasets. When I first incorporated it into this project, I was initially skeptical, but it appears to be remarkably effective (even more so than GPT at times).

My main determination of how accurate my models were was simply how accurate and clear each model was in representing its data. Interestingly, while both models handled the task relatively well, the “o3” model offered more consistent formatting and more precise designations of gene relationships, while Grok 3 was capable of understanding complex nuances in the paper and had longer, more thorough (and albeit more complicated) outputs.

I mentioned last week that small changes to the prompt can have significant impacts on the output. This week, one of my major edits was just that: a simple two-line edit to my instruction set emphasizing the need to dichotomize the options for genes to only be upregulated or downregulated. Much of my editing also came in the form of removing redundancy in my instructions to divert the AI’s processing toward the actual paper analysis.

Another update: I (well, the developers of ChatGPT and Grok) have developed a solution to the PDF unreadability problem I mentioned in my last post. Both the AIs can now simply read Word documents, bypassing the PDFs entirely. I still have no idea why my solution in the last post worked.

Looking ahead, I intend to continue refining the instructions and rely more on Grok for upcoming papers. My hope is that these incremental changes, along with better formatting, will improve extraction reliability without sacrificing clarity or consistency. Thanks for following along.

Adam

Comments:

All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

rohan_va

Hi Adam, the level of detail these AI models provide is fascinating! In the future, when this trained AI becomes widely accessible, do you anticipate users having to make small changes, or 're-train' the AI before their specific use? Or, would the product be able to adapt automatically to the different goals that users have?

February 23, 2025 at 1:37 pm - Reply

adam_b

Hi Rohan, that's a great question! Since we are training our AI model on a wide variety of papers, it should be able to adapt automatically according to the specifics of the user's instructions.

February 25, 2025 at 11:59 am - Reply

marcos_v

Hi Adam, this project is very interesting? I'm curious, for the finalized product, would you fully stick to one model, GPT or Grok, or do you intend to incorporate both of them? At what point would you implement this change?

February 28, 2025 at 2:25 pm - Reply

adam_b

Hi Marcos, That's a fantastic question! The finalized product is the prompt, sort of like a "brain" you could put into any model. It will be optimized for Grok because it is most effective at the moment, so I suppose you could say that change happened now.

February 28, 2025 at 6:50 pm - Reply

Week 3-A Supervised Approach to Training an Artificial Intelligence to Extract Relevant Genomic Data from Literature

More Posts

Comments:

Leave a Reply Cancel reply