Week 2 Introduction—A Supervised Approach to Training an Artificial Intelligence to Extract Relevant Genomic Data from Literature
Adam B -
Hi everyone, it’s Adam! In the last post, I mentioned I would describe some of the challenges that appear when training the AI model:
1) Inconsistent Paper Structure:
The first major problem I’ve found deals with the diverse formats of research papers. Some papers include lengthy tables that run across pages, others unconventional table structures, and still others have complex figures difficult for the model to process. The AI has occasionally missed important data when faced with these complex layouts, either by ignoring genes entirely or by mixing up the lines of the table so that the extracted data is mismatched. To fix this, I updated the instructions to scan documents in smaller, more systematic steps. This approach is still being tested but it appears to be working thus far.
2) PDF Unreadability
Another obstacle is PDF readability. Occasionally, PDFs were mentioned to be partially unreadable by the AI’s optical character recognition (OCR) tool. After simple trial and error, I found a peculiar solution: converting the original file to an image, then back into a PDF. I am unsure why this is an effective solution but I will continue to alter file types to figure out why it does.
By addressing these issues, there appears to be improved consistency in the outputs. Next time, I’ll discuss the ongoing updates to the prompt and how even small or self-evident tweaks can drastically improve the AI’s accuracy.
Comments:
All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.