# [ ELT Using Pandas ] ( CheatSheet )
1. Data Extraction
● Read CSV File: pd.read_csv('[Link]')
● Read Excel File: pd.read_excel('[Link]')
● Read JSON File: pd.read_json('[Link]')
● Read SQL Database: pd.read_sql(query, connection)
● Read HTML Table: pd.read_html('[Link]
● Read Parquet File: pd.read_parquet('[Link]')
● Read from Clipboard: pd.read_clipboard()
● Read from a Python Dictionary: [Link].from_dict(dict)
● Read from Multiple Files: [pd.read_csv(f) for f in file_list]
2. Data Loading
● Write to CSV File: df.to_csv('[Link]')
● Write to Excel File: df.to_excel('[Link]')
● Write to JSON File: df.to_json('[Link]')
● Write to SQL Database: df.to_sql(table_name, connection)
● Write to Parquet File: df.to_parquet('[Link]')
● Write to HTML File: df.to_html('[Link]')
● Append to Existing File or Database: df.to_sql(table_name, connection,
if_exists='append')
● Save to Python Pickle Format: df.to_pickle('[Link]')
3. Data Transformation
● Filtering Rows: df[df['column'] > value]
● Selecting Columns: df[['col1', 'col2']]
● Renaming Columns: [Link](columns={'old_name': 'new_name'})
● Dropping Columns: [Link](columns=['col1', 'col2'])
● Handling Missing Data: [Link](value) or [Link]()
● Type Conversion: [Link]({'col': 'int32'})
● String Operations: df['col'].[Link]()
● Datetime Conversion: pd.to_datetime(df['col'])
● Sorting Data: df.sort_values(by='col')
● Grouping and Aggregation: [Link]('col').sum()
● Pivot Tables: df.pivot_table(index='col1', values='col2', aggfunc='mean')
By: Waleed Mousa
● Merging DataFrames: [Link](df1, df2, on='col')
● Concatenating DataFrames: [Link]([df1, df2])
● Joining DataFrames: [Link](df2, on='col')
● Reshaping with Melt: [Link](df, id_vars=['col1'], value_vars=['col2'])
● Reshaping with Stack/Unstack: [Link]() or [Link]()
● Creating Dummy Variables: pd.get_dummies(df['col'])
● Applying Functions: [Link](lambda x: custom_function(x))
● Regular Expressions: df['col'].[Link]('(regex_pattern)')
● Handling Time Series Data: [Link]('D').mean()
● Rolling Window Calculations: [Link](window=5).mean()
● Conditional Logic: [Link](df['col'] > value, 'yes', 'no')
● Data Normalization: (df - [Link]()) / [Link]()
4. Advanced Data Transformation
● Binning Numerical Data: [Link](df['col'], bins)
● Discretizing Numerical Data: [Link](df['col'], q=4)
● Transforming with Map: df['col'].map(mapping_dict)
● Exploding List-Like Data: [Link]('list_col')
● Pivot Longer and Wider: df.pivot_longer() and df.pivot_wider() (Using
janitor library)
● Multi-Index Creation and Slicing: df.set_index(['col1', 'col2'])
● Cross-Tabulation: [Link](df['col1'], df['col2'])
● Aggregation with Custom Functions: [Link]('col').agg(custom_agg_func)
● Correlation Matrix: [Link]()
● Data Standardization for Machine Learning:
StandardScaler().fit_transform(df)
5. Data Cleaning
● Trimming Whitespace: df['col'].[Link]()
● Replacing Values: [Link]({'old_value': 'new_value'})
● Dropping Duplicates: df.drop_duplicates()
● Data Validation Checks: [Link].assert_frame_equal(df1, df2)
● Regular Interval Resampling for Time Series: [Link]('5T').mean()
6. Exploratory Data Analysis
● Descriptive Statistics: [Link]()
● Histograms for Distribution: df['col'].hist(bins=20)
By: Waleed Mousa
● Box Plots for Outliers: [Link](column='col')
● Pair Plots for Relationships: [Link](df)
● Heatmap for Correlation Analysis: [Link]([Link](), annot=True)
7. Handling Large Data
● Chunking Large Data Files: pd.read_csv('large_file.csv', chunksize=10000)
● Memory Usage of DataFrame: df.memory_usage(deep=True)
● Optimizing Data Types: [Link]({'col': 'category'})
● Lazy Evaluation with Dask: [Link].from_pandas(df)
8. Data Anonymization
● Hashing for Anonymization: df['col'].apply(lambda x:
hashlib.sha256([Link]()).hexdigest())
● Randomized Data Perturbation: df['col'] + [Link](0, 1,
[Link][0])
9. Text Data Specific Operations
● Word Count: df['text'].[Link]().[Link]()
● Text Cleaning (e.g., removing punctuation):
df['text'].[Link]('[^\w\s]', '', regex=True)
● Term Frequency: df['text'].[Link]().explode().value_counts()
10. Visualization for EDA
● Bar Plots: df['col'].value_counts().plot(kind='bar')
● Line Plots: [Link](kind='line', x='x_col', y='y_col')
● Scatter Plots: [Link](x='x_col', y='y_col')
● KDE Plots for Density: df['col'].[Link]()
11. Advanced Data Loading and Transformation
● Integrating with Web APIs: [Link](api_url)
● Loading Data from Remote Sources: pd.read_csv(remote_file_url)
● Complex Data Transformations: [Link](custom_complex_transformation)
12. Feature Engineering
By: Waleed Mousa
● Date Part Extraction: df['date_col'].[Link], df['date_col'].[Link],
etc.
● Lag Features for Time Series: df['feature'].shift(periods=1)
● Rolling Features for Time Series: df['feature'].rolling(window=5).mean()
● Differential Features: df['feature'].diff(periods=1)
13. Data Integration
● Combining Multiple Data Sources: [Link]([df1, df2], axis=0)
● Merging Data on Keys: [Link](df1, df2, on='key_column')
● Creating Database Connections for Extraction/Loading:
sqlalchemy.create_engine(db_string)
14. Performance Optimization
● Parallel Processing with Swifter: [Link](custom_function)
● Optimizing DataFrames with Eval/Query: [Link]('new_col = col1 + col2')
● Categorical Data Optimization: df['cat_col'] =
df['cat_col'].astype('category')
15. Error Handling and Data Quality
● Error Handling in Data Loading: try: pd.read_csv('[Link]') except:
handle_error()
● Data Quality Checks: assert df['column'].notnull().all()
16. Data Serialization and Compression
● Saving DataFrames in Compressed Format: df.to_csv('[Link]',
compression='gzip')
● Reading Compressed Data: pd.read_csv('[Link]', compression='gzip')
17. Using Pandas with Other Libraries for ETL/ELT
● Converting DataFrame to Spark DataFrame: [Link](df)
● Using Pandas with PySpark for Distributed Processing: spark_df =
[Link]('[Link]')
● Integration with NumPy for Mathematical Operations:
[Link](df['numeric_column'])
By: Waleed Mousa
18. Workflow Automation and Scripting
● Automating ETL Processes: [Link]().[Link]("10:30").do(etl_job)
● Running Pandas Operations in Scripts: python etl_script.py
19. Ensuring Data Consistency
● Data Type Validation: df['column'].dtype == 'expected_dtype'
● Consistency Checks Between DataFrames: [Link].assert_frame_equal(df1,
df2)
20. Reporting and Documentation
● Generating Summary Reports: profile = pandas_profiling.ProfileReport(df)
21. Database Specific Operations
● Querying Databases Directly: pd.read_sql_query('SELECT * FROM table',
engine)
By: Waleed Mousa