You are on page 1of 24

Here's a detailed training course outline for Python tailored to data analysis at a moderate

difficulty level, including information about instructors, course materials, delivery methods,
additional resources, and more:

Course Title: Python for Data Analysis

Course Duration: Approximately 8-10 weeks (flexible based on the pace of learning)

**Prerequisites: Participants should have a basic understanding of Python programming and


fundamental data concepts.

Target Audience:
● Data analysts
● Business analysts
● Data scientists/MLEs
● Anyone looking to use Python for data analysis or work in one of the above roles

Course Outline:

● Module 1: Introduction to Python and Data Analysis

❖ Duration: 1 week
❖ Topics:
1. Overview of Python and its role in data analysis
2. Setting up the Python environment (e.g., Anaconda)
3. Basic Python programming concepts (variables, data types, loops,
functions)
4. Introduction to Jupyter Notebooks for interactive coding

● Module 2: Data Manipulation with Pandas


❖ Duration: 2 weeks
❖ Topics
1. Introduction to Pandas for data manipulation
2. Data structures: Series and DataFrame
3. Data cleaning and preprocessing techniques
4. Indexing and slicing data
5. Handling missing data and duplicates
6. Merging and joining datasets

● Module 3: Data Visualization with Matplotlib and Seaborn


❖ Duration: 2 weeks
❖ Topics:
1. Data visualization principles and best practices
2. Introduction to Matplotlib for creating basic plots
3. Advanced plotting techniques and customization
4. Introduction to Seaborn for statistical data visualization
5. Creating interactive visualizations with Plotly

● Module 4: Exploratory Data Analysis (EDA)


❖ Duration: 2 weeks
❖ Topics:
1. Importance of EDA in data analysis
2. Descriptive statistics and summary metrics
3. Data distribution analysis
4. Visualizing relationships between variables
5. Detecting and handling outliers
6. Hypothesis testing for initial insights

● Module 5: Statistical Analysis with SciPy and Statsmodels


❖ Duration: 2 weeks
❖ Topics:
1. Introduction to statistical analysis concepts
2. Hypothesis testing (t-tests, ANOVA, ChiSquare)
3. Regression analysis (linear and logistic regression)
4. Time series analysis and forecasting
5. Interpretation of statistical results

● Module 6: Machine Learning Fundamentals with ScikitLearn


❖ Duration: 3 weeks
❖ Topics:
1. Introduction to machine learning and its applications
2. Supervised and unsupervised learning
3. Data preprocessing and feature engineering
4. Classification and regression algorithms (Decision Trees, Random Forest,
KNearest Neighbors, etc.)
5. Model evaluation and selection
6. Introduction to cross-validation and hyperparameter tuning

● Module 7: Data Wrangling and Advanced Topics


❖ Duration: 2 weeks
❖ Topics:
1. Advanced data cleaning and transformation techniques
2. Feature selection and engineering strategies
3. Handling categorical data and imbalanced datasets
4. Dimensionality reduction techniques (PCA, tSNE)
5. Introduction to natural language processing (NLP) for text analysis

● Module 8: Final Data Analysis Project


❖ Duration: 2 weeks
1. Participants work on a real-world data analysis project
2. Project includes data exploration, hypothesis testing, machine learning,
and data visualization
3. Regular feedback and assistance from instructors

Course Materials:

Textbook: "Python for Data Analysis" by Wes McKinney

This course structure provides a solid foundation in Python tailored to data analysis needs,
covering both intermediate and advanced topics. It equips participants with practical Python
data analysis skills and prepares them to tackle real-world data analysis challenges.

Syllabus Focus (indepth)

Module 1: Introduction to Python and Data Analysis

Here's a brief explanation of each topic in Module 1: Introduction to Python and Data
Analysis, along with sample Python code for each topic:

1. Overview of Python and its role in data analysis

Explanation: This topic introduces Python and its significance in data analysis. Python is
a versatile programming language with a vast ecosystem of libraries that are widely
used for data manipulation, analysis, and visualization.

Sample Code:

``` python
# Sample code demonstrating Python's versatility
print("Hello, Python!")

# Python code for calculating the sum of two numbers


a = 5
b = 3
sum_result = a + b
print("The sum of", a, "and", b, "is:", sum_result)

```

2. Setting up the Python environment (e.g., Anaconda)

Explanation: Here, you'll learn how to set up your Python environment using Anaconda,
a popular distribution that includes essential libraries and tools for data analysis.

Sample Code:
Installation of Anaconda is typically done through the Anaconda Navigator or command
line. There's no specific code for this topic.

3. Basic Python programming concepts (variables, data types, loops,


functions)

Explanation: This topic covers fundamental Python programming concepts, including


variables, data types (such as integers, strings, and booleans), loops (for and while
loops), and functions for code reusability.

Sample Code:

```python
# Sample code demonstrating basic Python concepts
name = "John" # Variable
age = 30 # Variable
is_student = True # Variable

for i in range(5): # Loop


print("Iteration", i)

def greet(name): # Function


return "Hello, " + name + "!"

message = greet(name)
print(message)
```

4. Introduction to Jupyter Notebooks for interactive coding

Explanation: Jupyter Notebooks provide an interactive coding environment where you


can combine code, text, and visualizations. They are commonly used in data analysis
for creating reproducible analyses.

Sample Code:

Open a Jupyter Notebook and add the following code to a cell:

```python
# Sample code in a Jupyter Notebook cell
print("Welcome to Jupyter Notebook!")
```

Execute the cell, and you'll see the output below the cell.

These sample code snippets illustrate the key concepts of Python and the use of
Jupyter Notebooks for interactive coding. They provide a hands-on introduction to
Python and set the foundation for further exploration in data analysis.

Module 2: Data Manipulation with Pandas

Let's explain each of the topics in Module 2: Data Manipulation with Pandas and provide
sample Python code for each topic.

1. Introduction to Pandas for data manipulation

Explanation: This topic introduces Pandas, a popular Python library used for data
manipulation and analysis. Pandas provides data structures and functions to work with
structured data efficiently.

Sample Code:

```python
# Sample code demonstrating Pandas basics
import pandas as pd
# Create a simple DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22]}

df = pd.DataFrame(data)
print(df)
```

2. Data structures: Series and DataFrame

Explanation: Pandas offers two primary data structures: Series (1D) and DataFrame
(2D). Series is ideal for working with single columns, while DataFrame is used for
tabular data with rows and columns.

Sample Code:

```python
# Sample code demonstrating Series and DataFrame
import pandas as pd

# Create a Series
series = pd.Series([10, 20, 30, 40])
print(series)

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22]}
df = pd.DataFrame(data)
print(df)
```

3. Data cleaning and preprocessing techniques

Explanation: This topic covers techniques for cleaning and preprocessing data,
including handling missing values, converting data types, and removing outliers.

Sample Code:

```python
# Sample code demonstrating data cleaning and preprocessing
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],


'Age': [25, None, 22]} # Missing value

df = pd.DataFrame(data)

# Fill missing values with the mean


df['Age'].fillna(df['Age'].mean(), inplace=True)

print(df)
```

4. Indexing and slicing data

Explanation: Indexing and slicing in Pandas allow you to select specific rows or columns
from a DataFrame using labels or integer-based indexing.

Sample Code:
```python
# Sample code demonstrating indexing and slicing
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],


'Age': [25, 30, 22]}

df = pd.DataFrame(data, index=['A', 'B', 'C'])

# Select a specific row


row_b = df.loc['B']

# Select a specific column


ages = df['Age']

print("Row B:\n", row_b)


print("Ages:\n", ages)
```

5. Handling missing data and duplicates


Explanation: This topic explores strategies for handling missing data and duplicates
within a dataset, ensuring data quality.

Sample Code:

```python
# Sample code demonstrating handling missing data and
duplicates
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],


'Age': [25, None, 22, 25]}

df = pd.DataFrame(data)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Fill missing values


df['Age'].fillna(df['Age'].mean(), inplace=True)

print(df)
```

6. Merging and joining datasets

Explanation: This topic covers techniques for combining multiple datasets using Pandas'
merge and join operations.

Sample Code:

```python
# Sample code demonstrating merging and joining datasets
import pandas as pd

data1 = {'ID': [1, 2, 3],


'Name': ['Alice', 'Bob', 'Charlie']}
data2 = {'ID': [2, 3, 4],
'Age': [25, 22, 30]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Merge datasets based on the 'ID' column


merged_df = pd.merge(df1, df2, on='ID', how='inner')

print(merged_df)
```

These sample code snippets illustrate key Pandas concepts and techniques for data
manipulation, including creating DataFrames, cleaning and preprocessing data,
indexing, handling missing values, removing duplicates, and merging datasets. These
skills are foundational for data analysis tasks in subsequent modules.

Module 3 : Data Visualization with Matplotlib

Let's explain each of the topics in Module 3: Data Visualization with Matplotlib and
Seaborn and provide sample Python code for each topic.

1. Data visualization principles and best practices

Explanation: This topic introduces the principles and best practices of data visualization,
including understanding the importance of visualizing data effectively to convey insights.

Sample Code:

This topic typically involves discussions and examples of good and bad data
visualization practices. There is no specific code associated with this topic.

2. Introduction to Matplotlib for creating basic plots

Explanation: Matplotlib is a popular Python library for creating static, non-interactive


visualizations. In this topic, you'll learn the basics of Matplotlib and how to create
fundamental plots.

Sample Code:

```python
# Sample code demonstrating basic Matplotlib plotting
import matplotlib.pyplot as plt

# Create a simple line plot


x = [1, 2, 3, 4, 5]
y = [10, 15, 7, 12, 9]

plt.plot(x, y)
plt.xlabel('Xaxis')
plt.ylabel('Yaxis')
plt.title('Simple Line Plot')
plt.show()
```

3. Advanced plotting techniques and customization

Explanation: Building on the basics, this topic explores advanced plotting techniques in
Matplotlib and how to customize plots with various styles, colors, and annotations.

Sample Code:

```python
# Sample code demonstrating advanced Matplotlib techniques
import matplotlib.pyplot as plt
import numpy as np

# Create a scatter plot with customizations


x = np.random.rand(50)
y = np.random.rand(50)
colors = np.random.rand(50)
sizes = 1000 * np.random.rand(50)

plt.scatter(x, y, c=colors, s=sizes, alpha=0.5,


cmap='viridis')
plt.colorbar()
plt.xlabel('Xaxis')
plt.ylabel('Yaxis')
plt.title('Scatter Plot with Customizations')
plt.show()
```

4. Introduction to Seaborn for statistical data visualization


Explanation: Seaborn is a high-level data visualization library built on top of Matplotlib.
This topic introduces Seaborn and its capabilities for creating informative statistical
visualizations.

Sample Code:

```python
# Sample code demonstrating Seaborn for statistical
visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Load a sample dataset


tips = sns.load_dataset("tips")

# Create a box plot using Seaborn


sns.boxplot(x="day", y="total_bill", data=tips)
plt.xlabel('Day of the Week')
plt.ylabel('Total Bill Amount')
plt.title('Box Plot of Total Bill Amount by Day')
plt.show()
```

5. Creating interactive visualizations with Plotly

Explanation: Plotly is a library for creating interactive visualizations. In this topic, you'll
learn how to use Plotly to create interactive charts and plots.

Sample Code:

```python
# Sample code demonstrating Plotly for interactive
visualizations
import plotly.express as px

# Create an interactive scatter plot


df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length",
color="species",
size="petal_length",
hover_data=["petal_width"])
fig.update_layout(title="Interactive Scatter Plot")
fig.show()
```

These sample code snippets introduce key concepts and tools for data visualization,
including Matplotlib for static plots, Seaborn for statistical visualization, and Plotly for
interactive visualizations. These skills are essential for conveying data insights
effectively in data analysis.

Module 5: Statistical Analysis with SciPy

Let's explain each of the topics in Module 5: Statistical Analysis with SciPy and
Statsmodels and provide sample Python code for each topic.

1. Introduction to statistical analysis concepts

Explanation: This topic introduces fundamental statistical concepts, such as


populations, samples, parameters, and statistics. Understanding these concepts is
crucial for conducting meaningful statistical analysis.

Sample Code:

```python
# Sample code illustrating statistical concepts
import numpy as np

# Generate a random sample


np.random.seed(42)
sample_data = np.random.normal(0, 1, 100)

# Calculate sample statistics


sample_mean = np.mean(sample_data)
sample_std = np.std(sample_data)

print("Sample Mean:", sample_mean)


print("Sample Standard Deviation:", sample_std)
```
2. Hypothesis testing (t-tests, ANOVA, ChiSquare)

Explanation: This topic delves into hypothesis testing, including t-tests for comparing
means of two groups, analysis of variance (ANOVA) for comparing means of multiple
groups, and chi-squared tests for analyzing categorical data.

Sample Code:

```python
# Sample code for hypothesis testing
import numpy as np
from scipy import stats

# Generate two samples for a t-test


np.random.seed(42)
sample1 = np.random.normal(0, 1, 50)
sample2 = np.random.normal(1, 1, 50)

# Perform a two-sample t-test


t_stat, p_value = stats.ttest_ind(sample1, sample2)

print("T-statistic:", t_stat)
print("P-value:", p_value)
```

3. Regression analysis (linear and logistic regression)

Explanation: Regression analysis is a powerful tool for modeling relationships between


variables. This topic covers both linear regression for predicting continuous outcomes
and logistic regression for binary classification.

Sample Code:

```python
# Sample code for linear regression
import numpy as np
import statsmodels.api as sm

# Generate synthetic data


np.random.seed(42)
X = np.random.rand(100, 1)
y = 2 * X + 1 + np.random.randn(100, 1)

# Fit a linear regression model


X = sm.add_constant(X) # Add a constant term (intercept)
model = sm.OLS(y, X).fit()
predictions = model.predict(X)

print(model.summary())
```

```python
# Sample code for logistic regression
import numpy as np
import statsmodels.api as sm

# Generate synthetic data


np.random.seed(42)
X = np.random.rand(100, 1)
y = (X > 0.5).astype(int)

# Fit a logistic regression model


X = sm.add_constant(X) # Add a constant term (intercept)
model = sm.Logit(y, X).fit()
predictions = model.predict(X)

print(model.summary())
```

4. Time series analysis and forecasting

Explanation: Time series analysis focuses on understanding and forecasting data points
collected over time. This topic introduces concepts like seasonality, trends, and
forecasting techniques.

Sample Code:

```python
# Sample code for time series analysis and forecasting
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Generate synthetic time series data


np.random.seed(42)
dates = pd.date_range(start="20200101", periods=100, freq='D')
values = np.random.randn(100)
ts_data = pd.Series(values, index=dates)

# Visualize the time series


ts_data.plot(figsize=(12, 6))
plt.title("Synthetic Time Series Data")
plt.xlabel("Date")
plt.ylabel("Value")
plt.show()

# Perform time series decomposition


decomposition = sm.tsa.seasonal_decompose(ts_data,
model='additive')
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

# Visualize decomposed components


plt.figure(figsize=(12, 8))
plt.subplot(411)
plt.plot(ts_data, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal, label='Seasonal')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residual')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
```
5. Interpretation of statistical results

Explanation: Interpreting statistical results is a critical skill. This topic covers how to
analyze and draw meaningful conclusions from the results of hypothesis tests,
regression analyses, and time series forecasts.

Sample Code:

Interpretation of results is context-specific and depends on the analysis performed in


earlier topics. Sample code may involve explaining the significance of p-values,
coefficients in regression models, or forecasting accuracy metrics in time series
analysis.

Module 6: Machine Learning Fundamentals with ScikitLearn

Let's explain each of the topics in Module 6: Machine Learning Fundamentals with
ScikitLearn and provide sample Python code for each topic.

1. Introduction to machine learning and its applications

Explanation: This topic provides an overview of machine learning and its real-world
applications. Students will understand the role of machine learning in data analysis and
decision-making.

Sample Code:

```python
# Sample code illustrating machine learning applications
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generate synthetic dataset for illustration


np.random.seed(42)
X = np.random.rand(100, 2)
y = (X[:, 0] + 2 * X[:, 1] > 1).astype(int)

# Visualize the data


plt.scatter(X[:, 0], X[:, 1], c=y)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Sample Synthetic Dataset Scatter Plot')
plt.show()
```

2, Supervised and unsupervised learning

Explanation: This topic differentiates between supervised and unsupervised learning.


Supervised learning involves predicting labels or values based on labeled training data,
while unsupervised learning deals with discovering patterns in unlabeled data.

Sample Code:

```python
# Sample code illustrating supervised and unsupervised
learning
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans

# Generate synthetic data for classification (supervised)


X, y = make_classification(n_samples=100, n_features=2,
n_informative=2, n_redundant=0, random_state=42)

# Supervised learning: Classification with KNearest Neighbors


(KNN)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)

# Unsupervised learning: KMeans clustering


kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
```

3. Data preprocessing and feature engineering

Explanation: Data preprocessing is a critical step in machine learning. This topic covers
techniques for cleaning and preparing data, handling missing values, and feature
engineering to create meaningful input features for models.
Sample Code:

```python
# Sample code illustrating data preprocessing and feature
engineering
import pandas as pd
from sklearn.preprocessing import StandardScaler,
OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Generate a synthetic dataset for illustration


data = pd.DataFrame({'age': [25, 30, 35, 40],
'gender': ['Male', 'Female', 'Male',
'Female'],
'income': [50000, 60000, None, 75000]})

# Define transformers for numerical and categorical features


numerical_features = ['age', 'income']
categorical_features = ['gender']

numerical_transformer = Pipeline(steps=[
('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer,
categorical_features)])

# Use the preprocessor to transform the data


X = preprocessor.fit_transform(data)
```

4. Classification and regression algorithms (Decision Trees, Random Forest,


KNearest Neighbors, etc.)
Explanation: This topic introduces common machine learning algorithms for both
classification and regression tasks. It covers Decision Trees, Random Forest, KNearest
Neighbors, and others.

Sample Code

```python
# Sample code illustrating classification and regression
algorithms
from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor

# Generate synthetic data for regression


X, y = make_regression(n_samples=100, n_features=1, noise=0.1,
random_state=42)

# Regression: Decision Tree Regression


dt_regressor = DecisionTreeRegressor(max_depth=3)
dt_regressor.fit(X, y)
```
5. Model evaluation and selection

Explanation: Evaluating and selecting the right model is crucial. This topic introduces
metrics for assessing model performance, including accuracy, precision, recall, F1score,
and mean squared error (MSE).

Sample Code:

```python
# Sample code illustrating model evaluation and selection
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,
mean_squared_error

# Split the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

# Evaluate a classification model


from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
classification_rep = classification_report(y_test, y_pred)

# Evaluate a regression model


from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
```

Module 7: Data Wrangling and Advanced Topics

Let's explain each of the topics in Module 7: Data Wrangling and Advanced Topics and
provide sample Python code for each topic.

1. Advanced data cleaning and transformation techniques

Explanation: This topic covers advanced data cleaning techniques such as handling
outliers, dealing with missing data, and transforming variables to achieve better data
quality.

Sample Code:

```python
# Sample code illustrating advanced data cleaning and
transformation
import pandas as pd
import numpy as np

# Generate a synthetic dataset with outliers and missing


values
np.random.seed(42)
data = pd.DataFrame({'A': np.random.randn(100),
'B': np.random.randint(1, 100, size=100),
'C': np.random.choice([1, 2, np.nan],
size=100)})
# Handling outliers (e.g., Winsorization)
data['A'] = np.where(data['A'] > 2, 2, data['A'])

# Dealing with missing data (e.g., imputation)


data['C'].fillna(data['C'].mean(), inplace=True)

# Transforming variables (e.g., log transformation)


data['B'] = np.log(data['B'])
```

2. Feature selection and engineering strategies

Explanation: This topic explores strategies for selecting relevant features and
engineering new features to improve model performance.

Sample Code:

```python
# Sample code illustrating feature selection and engineering
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import PolynomialFeatures

# Load a dataset
data = pd.read_csv('data.csv')

# Feature selection using ANOVA F-statistic


X = data.drop(columns=['target'])
y = data['target']
selector = SelectKBest(score_func=f_classif, k=3)
X_new = selector.fit_transform(X, y)

# Feature engineering: Polynomial features


poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
```

3. Handling categorical data and imbalanced datasets

Explanation: This topic covers techniques for handling categorical data, such as one-hot
encoding and label encoding, and strategies for dealing with imbalanced datasets.
Sample Code:

```python
# Sample code illustrating handling categorical data and
imbalanced datasets
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from imblearn.over_sampling import RandomOverSampler

# Load a dataset with categorical features and class imbalance


data = pd.read_csv('data.csv')

# Handling categorical data: Onehot encoding


encoder = OneHotEncoder()
encoded_features =
encoder.fit_transform(data[['categorical_column']])

# Handling imbalanced datasets: Oversampling


X = data.drop(columns=['target'])
y = data['target']
oversampler = RandomOverSampler()
X_resampled, y_resampled = oversampler.fit_resample(X, y)
```

4. Dimensionality reduction techniques (PCA, tSNE)

Explanation: This topic introduces dimensionality reduction techniques like Principal


Component Analysis (PCA) and tDistributed Stochastic Neighbor Embedding (tSNE) to
reduce the number of features while preserving data structure.

Sample Code:

```python
# Sample code illustrating dimensionality reduction
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Generate synthetic high-dimensional data


np.random.seed(42)
X = np.random.rand(100, 50)

# Dimensionality reduction using PCA


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Dimensionality reduction using tSNE


tsne = TSNE(n_components=2, perplexity=30, n_iter=300)
X_tsne = tsne.fit_transform(X)
```

5. Introduction to natural language processing (NLP) for text analysis

Explanation: This topic introduces the basics of Natural Language Processing (NLP) for
text analysis, including text preprocessing, tokenization, and simple text classification.

Sample Code:

```python
# Sample code illustrating NLP for text analysis
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Sample text data


text_data = [
"This is a positive review.",
"I didn't like this movie.",
"Great product! Highly recommended."
]

# Tokenization
nltk.download('punkt')
tokenized_text = [word_tokenize(text) for text in text_data]

# Text vectorization
vectorizer = CountVectorizer()
X = vectorizer.fit_transform([' '.join(tokens) for tokens in
tokenized_text])
# Text classification using Naive Bayes
y = [1, 0, 1] # Labels (1 for positive, 0 for negative)
classifier = MultinomialNB()
classifier.fit(X, y)
```

You might also like