Common Disease Datasets

Heet D - March 8, 2025 7:27 am

This past week, I began the data collection part of my project. I originally planned to use the MIMIC dataset for my project. This dataset comprises of real-world patient health data for over 40,000 patients who stayed in the critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. To ensure data privacy, all of this data had to be carefully curated and de-identified. For anyone to access the dataset, certain training requirements need to be completed, and privacy agreements need to be signed. This is because MIMIC is an electronic health record. It contains clinical notes, lab results, demographics, ICU stays, any and all information that hospitals collect alongside image data. Data privacy requirements are enforced to ensure that no matter what, patients’ data is both private and confidential.

I was initially planning to use the MIMIC dataset since it would give me a good baseline to evaluate my model’s performance. However, when I was in the process of completing the training, I realized that my project could potentially violate the data privacy agreement of this dataset. This is because my project requires models like CLIP and GPT-4 to access the data for few-shot considerations which could potentially be seen as violating the privacy policy since sharing data with third parties is prohibited. I didn’t want to risk violating the agreements, which is why I began searching for a new dataset and came across the NIH Chest X-rays dataset on Kaggle, which is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. Since this dataset is available on Kaggle that means there is no risk of data privacy being compromised because any de-identification steps required have already been taken. It is also a good alternative to the MIMIC dataset since NIH is a large dataset that is labeled by several experts. I plan on pre-processing the data as my next step to train my models and establish a baseline. By the next blog, I aim to have data pre-processing done and begin testing my models on common disease data. Thank you!

Comments:

All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

rohan_v

Hey Heet, I'm glad you found a new dataset to use that doesn't have any privacy concerns! Could using the NIH data set instead of the MIMIC dataset limit the accuracy and performance of your model in any way?

March 12, 2025 at 5:02 pm - Reply

heet_d

Great question Rohan! Since the MIMIC dataset is more of an electronic health record, it consists of information like lab results and patient history that the NIH dataset does not have. This additional information could potentially serve as additional context and allow for better generalization. So it is possible, that MIMIC could lead to better performance. But at the same time, the NIH dataset consists of high-quality labeled images which are ideal for image analysis. A lot of researchers tend to use this dataset for this reason. There are certain advantages and disadvantages to each dataset, so the best choice depends on the requirements of the research. Ultimately, I think its hard to tell if there would be an impact on my results without testing both datasets first.

March 13, 2025 at 11:35 am - Reply

Common Disease Datasets

More Posts

Comments:

Leave a Reply Cancel reply