Plungin In: Uncovering the Surface of Breast Cancer Data

Alisha J - February 22, 2025 9:26 pm

Welcome back to my blog!

This past week has been quite hectic as I delved deeper into the breast cancer data. I found myself constructing numerous tables in Excel, filled with hundreds of rows and columns. In addition, I’ve been comparing multiple datasets to identify their similarities and differences. During this process, I discovered a new tool in Excel—Power Query—that I had not previously utilized. This tool has enabled me to efficiently cross-compare various datasets and extract information from multiple sources, such as files and web sources.

However, I have encountered a challenge in the dataset comparison. Specifically, I am dealing with entirely different values (which I’d like the system to flag as discrepancies) and varying measures of accuracy (which are also currently flagged in the process). I am actively working to address this issue with my advisor.

Looking ahead to the upcoming week, I am focused on gaining access to .RAW files as I strive to understand how the authors transformed the original raw mass spectrometry data into the dataset provided on Kaggle: https://www.kaggle.com/datasets/piotrgrabo/breastcancerproteomes/data

Below are some screenshots showcasing the outcomes of my comparisons between two tables for both similarities and differences.

The first image illustrates a manual comparison I conducted in Excel for two rows, where each cell is flagged due to differing measures of accuracy or different values, despite identical cell values.

The first image displays the resulting table of differences when comparing the gene symbol and gene name between the two datasets.

The final image shows the resulting table of differences when analyzing the gene symbol, gene name, and accession number across the two datasets.

Thank you all for your continued support, and stay tuned for further updates in the coming months!

Best regards,
Alisha Jindal

Comments:

All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

evangeline_c

It sounds like you’ve had a super productive week! Your process of comparing the datasets and finding discrepancies to flag is very interesting.

February 24, 2025 at 11:34 am - Reply

mikyle_h

Hi Alisha, it sounds like you're making great progress with your breast cancer dataset analysis! As you dive deeper into analyzing discrepancies between the datasets, how do you plan to handle situations where data mismatches might not just be due to measurement errors but potentially represent actual biological differences?

February 25, 2025 at 4:15 pm - Reply

alisha_j

Thank you for your question, Mikyle! It's certainly valid to consider the possibility of data mismatches arising from biological differences. However, based on the dataset and information I have, I cannot determine whether those differences exist. In this context, I am setting aside the potential issue of data inconsistencies due to actual biological variations. I hope this clarifies your concerns!

March 3, 2025 at 5:45 pm - Reply

aditya_l

Hi Alisha, thanks for sharing, and you're datasets have nice labels that point to clear accession numbers. I am curious, do any of your datasets (or the database for which you are using accession numbers) contain expression values for these genes?

February 27, 2025 at 11:12 am - Reply

alisha_j

Great question, Aditya! The current datasets I am working with do include the expression values for the genes. I hope this clarifies your question!

March 3, 2025 at 11:14 pm - Reply

Plungin In: Uncovering the Surface of Breast Cancer Data

More Posts

Comments:

Leave a Reply to aditya_l Cancel reply