Week 6-A Supervised Approach to Training an Artificial Intelligence to Extract Relevant Genomic Data from Literature

Adam B - March 21, 2025 9:16 pm

Hello
This week’s emphasis continued to focus on the implementation of my paper processor into API format. Since, originally, the AI was largely web-based, I am finding many unique discrepancies between its old version and its new API. Here, I will discuss those and how they have been addressed throughout the week.

Information Restrictiveness

One of the most important concerns with the AI is that it is as accurate as possible. While the web-based platform was relatively contained in its data collection (even, at times in the early stages, under-reporting what the paper had found), the API version collects far more extra information that is either irrelevant or contradictory to what is suggested by the article.

Throughout the week, I continued to modify the code to remedy these errors. One major fix involved refining how the prompt distinguishes between core text and supplementary areas (like citation references, tables, and figures). By clarifying that only the main body of the text should be considered authoritative (and, for example, NOT the titles of cited sources), the AI became more accurate and consistent. Still, however, there are instances when it captures information from less relevant sections. Perfecting this process will be the subject of later weeks.

Validation

While not exactly a web vs. API discrepancy, an important test for an API is the validation process, which I began this week. This involves the use of a cohort of 20 papers previously analyzed by me and other lab members. The key difference between this phase and testing is how these 20 are delivered: previously, they had been done one at a time, but now the model will receive them en masse.

This week was simply about having the model process these papers. Next week, I’ll dedicate time to assessing the outputs and coding additional constraints or clarifications that might tighten focus on essential data.

Adam

Comments:

All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

camille_bennett

Hi Adam, so interesting to continue to see your insight in training this model. Do you anticipate any issues with the model analyzing the papers en masse vs individually?

March 25, 2025 at 9:51 am - Reply

adam_b

Hi Ms. Bennett, that's a good question! I do anticipate the model being less effective with multiple papers compared to individual ones because of the amount of additional processing that will be needed. Hopefully, with more concise inputs however, this can be remedied.

March 25, 2025 at 4:28 pm - Reply

Rahul Patel

Hey Adam, great progress on refining the AI! I'm curious—how will you prevent the AI from pulling irrelevant or contradictory info from supplementary sections like citations or tables as you continue fine-tuning the prompt? Looking forward to seeing the next steps!

March 27, 2025 at 12:06 am - Reply

aashi_h

Hi Adam, I love your project so far! What constraints or clarifications do you anticipate you will have to code to strengthen the model's validation?

March 27, 2025 at 11:36 am - Reply

adam_b

Hi Rahul, great question! Preventing the AI from pulling irrelevant information can be a simple as adding that to the instructions; simply mentioning to “not to pull from citations” or “keep tables but ignore unprocessable graphs” appears to be instruction enough.

March 28, 2025 at 3:41 pm - Reply

adam_b

Hi Aashi, great question! It’s difficult to say what changes I’ll need to make to the model before I’m fully done with the validation process but I anticipate they will be many small changes compared to any full restructuring of the model.

March 28, 2025 at 3:43 pm - Reply

Week 6-A Supervised Approach to Training an Artificial Intelligence to Extract Relevant Genomic Data from Literature

More Posts

Comments:

Leave a Reply Cancel reply