Week 2 Introduction—A Supervised Approach to Training an Artificial Intelligence to Extract Relevant Genomic Data from Literature

Adam B -

Hi everyone, it’s Adam! In the last post, I mentioned I would describe some of the challenges that appear when training the AI model:

1)    Inconsistent Paper Structure:

The first major problem I’ve found deals with the diverse formats of research papers. Some papers include lengthy tables that run across pages, others unconventional table structures, and still others have complex figures difficult for the model to process. The AI has occasionally missed important data when faced with these complex layouts, either by ignoring genes entirely or by mixing up the lines of the table so that the extracted data is mismatched. To fix this, I updated the instructions to scan documents in smaller, more systematic steps. This approach is still being tested but it appears to be working thus far.

2)    PDF Unreadability

Another obstacle is PDF readability. Occasionally, PDFs were mentioned to be partially unreadable by the AI’s optical character recognition (OCR) tool. After simple trial and error, I found a peculiar solution: converting the original file to an image, then back into a PDF. I am unsure why this is an effective solution but I will continue to alter file types to figure out why it does.

By addressing these issues, there appears to be improved consistency in the outputs. Next time, I’ll discuss the ongoing updates to the prompt and how even small or self-evident tweaks can drastically improve the AI’s accuracy.

More Posts

Comments:

All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

    ryan_s
    Hello my fine sir. Do you think it would be possible to automate the fixes to your current errors so you wouldn't have to change every bit of information you train your AI with?
    adam_b
    That is a great question Ryan! A lot of these errors involve fine tuning or adding clauses of instructions so I usually don’t have to reinvent the instructions each time. However, I’m sure eventually it might be possible for a machine refine the instructions with enough training.

Leave a Reply

Your email address will not be published. Required fields are marked *