You are on page 1of 10

ANALYSIS ON OLYMPIC DATASET

Using Logistic Regression ,Random Forest


Algorithm

Submitted in partial fulfillment of the

requirementsFor the award of the degree of

Bachelors of Computer Applications

To

Guru Gobind Singh Indraprastha University, Delhi

Guide:
Submitted by:
Mr. Himanshu Pabbi
Sakshi Ujjlayan(BCA-V) , 35413702021
(Assistant Professor)
Nivedita Agarwal (BCA-V) , 09413702021
Ms. Suman Singh
(Assistant Professor)

Institute of Information Technology & Management


New Delhi- 110058
Batch (2021-2024)
CERTIFICATE

We, Nivedita Agarwal (09413702021), Sakshi Ujjlayan( 35413702021),certify that the


Summer Training Project Report (BCA-331) entitled "Analysis on 120 years of
Olympic Dataset Using Logistic Regression & Random Forest Algorithm” is done by
us and it is an authentic work carried out by us at Institute of Information Technology
& Management, the matter embodied in this project work has not been submitted
earlier for the award of any degree or diploma to the best of our knowledge and belief.

Signature of the Student Signature of the Student

Date:

Certified that the Project Report (BCA-331) entitled “120 years of Olympics Dataset
Using Data Science Algorithms” done by the above student is completed under our
guidance.
Signature of the Guide:

Date:

Name of the Guide: Mr. Himanshu Pabbi

Ms. Suman Singh

Designation: Assistant Professor

Prof.(Dr.) Sudhir Kumar Sharma Prof. (Dr.) Rachita Rana

Counter sign HOD- Computer Science Counter sign Director


Acknowledgement

We , would like to express our sincere gratitude to the everyone who played an important
rolein the successful completion of our project.

Our dedicated project guides Mr. Himanshu Pabbi and Ms. Suman Singh whose
guidance, support, and invaluable insights were instrumental in shaping this project. Your
unwavering commitment to excellence and your willingness to share your knowledge
have been truly inspiring.

Our esteemed summer training teacher Dr. Prateek Gupta, whose expertise and
mentorship during the training period has enriched our understanding of the subject
matter. Your encouragement and constructive feedback were invaluable in honing our
skills.

Our esteemed Head of Department Prof. (Dr.) Sudhir Kumar Sharma, whose leadership
and vision have created an environment conducive to learning and innovation. Your
support foracademic endeavors has been a constant source of motivation.

I am also grateful to all the faculty members, friends, and family who supported me
throughout this journey.

This project would not have been possible without the collective wisdom and
encouragement of these individuals. I thank each one of you from the bottom of our heart
for your contributions.

Nivedita Agarwal (BCA-V) , 09413702021


Sakshi Ujjlayan (BCA-V) , 35413702021
ABSTRACT

This comprehensive study employs logistic regression and random forest analysis to
delve into 120 years of Olympic Games data, spanning from 1896to 2016. By utilizing
these advanced statistical techniques, we aim to predict medal-winning probabilities for
countries while considering a multitude of factors, including population size, host country
advantage, historical performance, and more. This analysis provides valuable insights
into the intricate interplay of factors influencing a nation's performance in the world's
most prestigious sporting event, shedding light on the ever-changing landscape of global
sports dominance The project offers the best approach to the model with the accuracy of
0.89.
TABLE OF CONTENTS

S. No. TOPIC PAGE No.

1. CERTIFICATE -

2. ACKNOWLEDGEMENT -

3. ABSTRACT -

4. SYNOPSIS 1-5

5. CHAPTER-1 INTRODUCTION 6-11

1.1 Description of the topic

1.2 Problem Statement

1.3 Objectives

1.4 Scope of the Project

1.5 Project planning Activities

1.5.1 Team-Member wise work distribution


table
1.5.2PERT Chart
1.5.3
1.6 Organization of the report

6. CHAPTER-2 LITRATURE REVIEW 12-16

7. CHAPTER 3 – SYSTEM DESIGN AND 17-19


METHODOLOGY
3.1 System Design
3.2 Algorithm Used
8. CHAPTER 4 – IMPLEMENTATION & RESULT 20-36

4.1 Hardware and Software Requirement:

4.2 Implementation Details

4.3 Results

9. CHAPTER 5 – CONCLUSION AND FUTURE 37-39


WORK

5.1 Conclusion

5.2 Future Scope

10. REFERENCES 40
Synopsis

1. Title of the Project

Title: 120 years of Olympics Dataset Using data science algorithms

2. Statement about the Problem

The questions posed by the Olympic 120 Years Dataset are complex and fascinating,
providing a unique opportunity to analyze and understand a century of Olympic history.
Researchers and data scientists can explore a wide range of questions and challenges
within this dataset, predictive modeling, and ethical considerations. By leveraging this
rich dataset, we can uncover historical trends, make predictions about future Olympic
events, and gain valuable insight into the evolution of sports, athletes, and countries'
performance on the world stage

3. Significance of the Project

The project involving the application of logistic regression to the "120 Years of
Olympics" dataset holds significant importance as it combines the power of data analytics
with the historical legacy of the Olympic Games. By employing logistic regression,
researchers can unravel intricate patterns and relationships within this extensive dataset,
particularly in the context of predicting medal outcomes. Logistic regression can uncover
factors that influence an athlete's likelihood of winning a medal, providing insights into the
nuanced dynamics of theOlympics.

4. Objective
The objectives of analyzing the "120 Years of Olympics" dataset using logistic regression
and Random Forest are to predict future medal winners, understand historical trends in
Olympic performance, assess the influence of athlete demographics, evaluate the impact
of hosting the Olympics, promote fairness and inclusivity, allocate resources effectively,
and uncover insightsinto the evolution of Olympic participation.
5. Scope
The scope of this project encompasses the following aspects:

• Data collection from diverse product categories.


• Preprocessing and cleaning of the collected data.
• Development and training of a logistic regression model and random forest.
• Comparative analysis of performance across different sports categories.
• Presentation of results and insights.

6. Hardware and Software

SpecificationHardware Specifications

Minimum Hardware Requirements

Processor Intel(R) Core(TM) I5 or equivalent

CPU 1.60ghz

Memory At least 2.00GB

Hard Disk 500GB

Display Super VGA (1366 ´ 768) or higher

resolution monitor

Input Devices Keyboard, Mouse

Software Specifications

Minimum Software Requirements

Frontend Python

Browser Mozilla Firefox, Google Chrome etc.

Development tool Jupiter Notebook, Google


Colab,Anaconda
6. Data Collection and Methodology

Data Collection is one of the most important aspects in Data Analysis. The dataset can be
been taken from www.kaggle.com, www.brightdata.com, etc. Due to the wide adoption of
machine learning models, simply having large datasets on a domain specific task does not
ensure superior performance. Therefore, the dataset must be cleaned and preprocessed
before training. As Machine Learning models learn from the data, they are trained with
automatic predictions are likely to mirror the human disagreement identified during
annotation. As a result, having proper data cleaning and preprocessing of dataset is
required.

7. Algorithm

The algorithm for analyzing the "120 Years of Olympics" dataset using logistic regression
begins with data preprocessing, which involves cleaning the dataset, encoding categorical
variables, and defining a binary target variable (medal or no medal). Next, the dataset is
split into training and testing sets for model evaluation. Also build random forest models
to predict medal outcomes using an ensemble of decision trees, fine-tuning
hyperparameters. The algorithm will be explained in detail, including the mathematical
foundation and implementation.

8. Limitations of the Project

• Possible issues with missing or incomplete historical records.


• Risk of overfitting due to complex dataset with limited data points.
• May not account for external factors like technology advancements or
geopoliticalevents.
• Assumes linear relationships between features and outcomes, which may not
hold in all cases.
• Limited ability to capture nonlinear relationships between variables.
• Logistic regression might not handle highly complex interactions between
features effectively.
• Olympic contexts may change, requiring continuous model updates to
maintain accuracy and relevance.
9. Conclusion and Future Scope for Modification

In conclusion, this project aims to provide businesses with an improved methodology for
medal prediction using logistic regression. The project's limitations will be acknowledged.
The future scope includes enhancing data collection methods, exploring advanced machine
learning techniques, and implementing real-time data integration for more dynamic
predictions.

10. References

All sources and references used in this project documentation will be listed in accordance
withthe chosen citation style to ensure zero plagiarism.

You might also like