You are on page 1of 5

EXPERIMENT-2

Problem: Linear regression of the random data set.


The training set should be used to build your machine learning models. For the training set, we
provide the outcome (also known as the “ground truth”) for each passenger. Your model will be
based on “features” like passengers’ gender and class. You can also use feature engineering to
create new features.
The test set should be used to see how well your model performs on unseen data. For the test
set, we do not provide the ground truth for each passenger. It is your job to predict these
outcomes. For each passenger in the test set, use the model you trained to predict whether or not
they survived the sinking of the Titanic.

Steps:
Step 1: Import the relevant python libraries for the analysis
Import pandasaspd
Import numpyasnp
Import matplotlib.pyplotasplt
from sklearn import linear_model
from scipy import stats as st
import math

Step 2: Load the train and test dataset and clean the dataset
dirty_training_set = pd.read_csv('train.csv')
dirty_test_set = pd.read_csv(‘test.csv')
training_set = dirty_training_set.dropna()
test_set = dirty_test_set.dropna()
print ("Rows before clean: ", dirty_training_set.size, "\n")
print ("Rows after clean: ", training_set.size, "\n")

Step 2: Find mean median and standard deviation


x_training_set = training_set.as_matrix(['x'])
y_training_set = training_set.as_matrix(['y'])
x_test_set = test_set.as_matrix(['x'])
y_test_set = test_set.as_matrix(['y'])
print ("Mean of X Training set: ", np.mean(x_training_set), "\n")
print ("Median of X Training set: ", np.median(x_training_set), "\n")
print ("Mean of Y Training set: ", np.mean(y_training_set), "\n")
print ("Median of Y Training set: ", np.median(y_training_set), "\n")
print ("Std Dev of X Training set: ", np.std(x_training_set), "\n")
Output:

Step 4:Find the relationship between variables

plt.title('Relationship between X and Y')


plt.scatter(x_training_set, y_training_set, color='black')
plt.show()

plt.subplot(1, 2, 1)
plt.title('X training set')
plt.hist(x_training_set)
plt.subplot(1, 2, 2)

plt.title('Y training set')


plt.hist(y_training_set)
plt.show()

plt.subplot(1, 2, 1)
plt.title('X training set')
plt.boxplot(x_training_set)

plt.subplot(1, 2, 2)
plt.title('Y training set')
plt.boxplot(y_training_set)
plt.show()
Output:

Step 5: Set up linear regression modal.


lm = linear_model.LinearRegression()
lm.fit(x_training_set,y_training_set)
print('R sq: ',lm.score(x_training_set,y_training_set))
print('Correlation: ', math.sqrt(lm.score(x_training_set,y_training_set)))
Output:

Step 6: Find the coefficient


print("Coefficient for X ", lm.coef_)
print ("Standard Error: ",st.sem(x_training_set)
ttest = lm.coef_/st.sem(x_training_set)
print ("The t-statistic:",ttest)
print ("Two tailed p-values: ")
st.pearsonr(x_training_set, y_training_set)

Output:

Step 7: Test the trained model


y_predicted = lm.predict(x_test_set)

plt.title('Comparison of Y values in test and the Predicted values')


plt.ylabel('Test Set')
plt.xlabel('Predicted values')
plt.scatter(y_predicted, y_test_set, color='black')
plt.show()

Output:
Conclusion:
As we expected it's a really good fit.

You might also like