Common Disease Datasets
Heet D -
This past week, I began the data collection part of my project. I originally planned to use the MIMIC dataset for my project. This dataset comprises of real-world patient health data for over 40,000 patients who stayed in the critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. To ensure data privacy, all of this data had to be carefully curated and de-identified. For anyone to access the dataset, certain training requirements need to be completed, and privacy agreements need to be signed. This is because MIMIC is an electronic health record. It contains clinical notes, lab results, demographics, ICU stays, any and all information that hospitals collect alongside image data. Data privacy requirements are enforced to ensure that no matter what, patients’ data is both private and confidential.
I was initially planning to use the MIMIC dataset since it would give me a good baseline to evaluate my model’s performance. However, when I was in the process of completing the training, I realized that my project could potentially violate the data privacy agreement of this dataset. This is because my project requires models like CLIP and GPT-4 to access the data for few-shot considerations which could potentially be seen as violating the privacy policy since sharing data with third parties is prohibited. I didn’t want to risk violating the agreements, which is why I began searching for a new dataset and came across the NIH Chest X-rays dataset on Kaggle, which is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. Since this dataset is available on Kaggle that means there is no risk of data privacy being compromised because any de-identification steps required have already been taken. It is also a good alternative to the MIMIC dataset since NIH is a large dataset that is labeled by several experts. I plan on pre-processing the data as my next step to train my models and establish a baseline. By the next blog, I aim to have data pre-processing done and begin testing my models on common disease data. Thank you!

Comments:
All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.