Finalized Dataset

Heet D -

In this post, I will explain domain adaptation and also describe my progress for the week. 

So as promised, domain adaptation refers to taking a model trained on one domain (the source domain) and adapting that to perform well on a different domain (the target domain). 

You can think of this as teaching a chef who specializes in Thai cuisine to cook Italian food. The chef already has the fundamental cooking skills needed, yet there are some nuances between Thai and Italian food that the chef may not know. In my case, CLIP would be the chef. The source domain would be the general images that CLIP was trained on, and the target domain is specifically the medical chest x-ray images I will be using. 

By adapting CLIP theoretically, it should be better able to perform on the unique aspects that medical images tend to have such as grayscale, anatomical structures, subtle abnormalities, etc. My results without domain adaptation were far too low, which is why I looked to other papers and code sources and found this as a potential solution that seems to have worked. And finally, it is actually “adaptation” not “adaption”, which I mistyped all over my last blog post (now fixed). 

This past week I made yet another change to the dataset I will be using for the rare disease part of my project. But I promise this is the last one and no more changes.

As I mentioned previously, the dataset is often one of the most crucial parts of machine learning projects, and so it is very important that I make the right choice. I was planning on using CheXpert, but now am using the MIMIC-CXR dataset. In my earlier blog posts, I was planning on using a similar MIMIC dataset for the common disease portion of my project. However, I was under the impression that I might violate the privacy policy if I did continue using that. But after looking into this matter more, and looking through other research papers which also used MIMIC datasets, I now believe my implementation falls under the guidelines, and so have decided to stick with this dataset.

I came across this MIMIC dataset through the following paper by Holste et al., Long-Tailed Classification of Thorax Diseases on Chest X-Ray: A New Benchmark Study. This study uses a combination of both the NIH Chest X-ray dataset, my common disease dataset, and the MIMIC-CXR dataset, my rare disease dataset, so aligns really well with my own project. This paper further supports the idea that MIMIC-CXR is the best dataset for my needs.

Overall this week, I have extracted images for few-shot learning and have the data ready. Now the only parts left remaining in my project, are actually running these tests with the data, and comparing all my results with each other. I will get into more detail with this in my next post. Thank you!

 

More Posts

Leave a Reply

Your email address will not be published. Required fields are marked *