You are on page 1of 4

PROJECT-I PROPOSAL

Name: Bhaskar Moguthala Suid: L00179396


TITLE: LUNG CANCER PREDICTION
PROBLEM DESCRIPTION: I collected data about people with lung
cancer with three types of levels like low, medium and high, having
details like age, gender, surroundings, and lifestyle. My main aim is to
create models that can predict lung cancer risks using this varied
information. By doing this, I want to improve the ability to find lung
cancer early, figuring out what factors contribute the most, and help
understand lung cancer better. My goal is to assist in identifying
potential health issues sooner and take actions to improve
healthcare.
GOALS: My main goal is to understand the factors which causes lung
cancer, and I will give the answers for the following questions:
1. Predicting the probability of a patient developing lung cancer.
2. Recognizing factors that increase the risk of developing lung
cancer.
3. Deciding on the treatment that is most likely to be effective for
a patient diagnosed with lung cancer.
DESCRIPTION OF DATA: The lung cancer dataset contains diverse
attributes such as age, gender, environmental exposures, and lifestyle
factors. The presence of both categorical and numerical features
makes it rich in information. The inclusion of symptoms like coughing
of blood, chest pain, and weight loss adds clinical relevance.
Link for the source of the data:
https://www.kaggle.com/datasets/thedevastator/cancer-patients-
and-air-pollution-a-new-link/data
CHALLENGES:
Missing Values: Handling missing data, especially in variables like
genetic risk or occupational hazards, may pose challenges in ensuring
the dataset's completeness.

Data Imbalance: Imbalances in the distribution of lung cancer cases


versus non-cases could affect model training, requiring techniques to
address this issue.

Categorical Variables: Proper encoding of categorical variables like


smoking status and dust allergy is essential for machine learning
models.

Data Quality: Ensuring accuracy and reliability of data, particularly in


self-reported variables such as smoking habits or environmental
exposures, is crucial.

Data Privacy: Dealing with sensitive health information requires


careful handling to maintain privacy and comply with ethical
standards.

Feature Scaling: Variables like age and air pollution exposure may
have different scales, necessitating appropriate scaling for certain
machine learning algorithms.

Data Preparation Difficulties:


Obtaining a clean and well-structured dataset may require addressing
the challenges mentioned above through imputation techniques,
handling imbalances, and rigorous quality checks.

Methodology:
Data Cleaning and Preprocessing:
Handle missing values: Impute or remove missing data based on the
nature of the variables.
Encode categorical variables: Convert categorical features into
numerical representations.
Normalize/Scale: Standardize numerical features to ensure
uniformity in scale.
Addressing the data imbalances: Employ techniques such as
oversampling or under sampling to handle any class imbalances.
Exploratory Data Analysis (EDA):
 Explore data distributions and correlations among variables.
 Identify potential outliers and anomalies.
 Conduct statistical analysis to understand the characteristics of
the dataset.
Feature Selection:
 Utilize statistical methods or machine learning algorithms to
identify and select relevant features.
 Consider domain knowledge and medical expertise to prioritize
significant predictors.
Model Development:
 Splitting the dataset into training and testing sets.
 Implementing at least two predictive models.
Model Training and Evaluation:
 Train the selected models on the training set.
 Evaluate model performance using metrics such as accuracy,
precision, recall, and F1-score.
 Fine-tune hyperparameters to optimize model performance.
Validation and Testing:
 Validate models on a separate validation set to ensure
generalizability.
 Assess models' performance on the testing set to simulate real-
world scenarios.
Interpretability and Explanation:
 Provide interpretability for the developed models, especially
important for healthcare applications.
 Explain the contribution of different features in predicting lung
cancer risk.
Model Deployment:
Deploy the final model for real-world use, if deemed appropriate and
ethical.
Implement monitoring mechanisms to track model performance over
time.

Model Selection:
Logistic Regression:
Interpretability: Logistic regression provides a clear
interpretation of coefficients, facilitating the identification of key
predictors.
Simplicity: Its simplicity makes it easier to understand and explain to
healthcare professionals and stakeholders.
Random Forest Classifier:
Complex Relationships: Random Forest excels at capturing complex
relationships and interactions among variables, which can be crucial
in a dataset with diverse predictors.
Robustness: Random Forest is less prone to overfitting, making it
suitable for datasets with varying degrees of predictor importance
and noise.
These models, when used together, can offer a balanced approach,
combining interpretability with the ability to capture intricate
patterns, ensuring a robust lung cancer prediction system.
Bibliography:
Das, N. B. a. S., n.d. IEEE Xplore. [Online]
Available at: https://ieeexplore.ieee.org/document/9132913

You might also like