Week 8-A Supervised Approach To Training An Artificial Intelligence To Extract Relevant Genomic Data From Literature
Adam B -
Hello
This week, I began validating the AI by processing 20 research papers by my own hand. This serves as a control to compare the AI’s output. During this process of comparison, I identified several notable concerns.
Upregulation or Downregulation:
One of the most important aspects of the AI’s output is whether a given gene is upregulated or downregulated. The latest concern is that some genes are being misclassified due to unclear criteria (For example, on how to process between simple fold changes or LogFC, measurements of gene upregulation or downregulation that are either based on the multiplicative increase [Folds] of genetic material found in a sample or the Log of that value [Hence, Log(FC)] ) and a default setting to opt for downregulation when faced with an ambiguous gene case. These issues appear to be repairable with improvements to the instructions on how threshold-based designations should be handled as well as how to handle ambiguous gene output.
Table Fables:
Another critical issue arises from the AI’s method of extracting information from tables embedded within the articles. Certain data tables list experimental conditions and statistical measures in dense, varied formats, and the AI’s current instructions appear insufficient for recognizing the correct data boundaries: some extractions contain incorrect gene-condition pairings or others have information too jumbled to understand constructively. This issue should be fixed by providing a more systematic approach to process tables for the model to utilize instead of having it develop one on its own each time it is given a table.
Next week, the validation process will continue alongside improvements to the AI model.
Adam
Comments:
All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.