Python For Data Science Cheat Sheet Excel Spreadsheets Pickled Files
>>> file = 'urbanpop.xlsx' >>> import pickle
Importing Data >>> data = pd.ExcelFile(file) >>> with open('pickled_fruit.pkl', 'rb') as file: pickled_data = pickle.load(file) >>> df_sheet2 = data.parse('1960-1966', Learn Python for data science Interactively at www.DataCamp.com skiprows=[0], names=['Country', 'AAM: War(2002)']) >>> df_sheet1 = data.parse(0, HDF5 Files parse_cols=[0], Importing Data in Python skiprows=[0], >>> import h5py >>> filename = 'H-H1_LOSC_4_v1-815411200-4096.hdf5' names=['Country']) Most of the time, you’ll use either NumPy or pandas to import >>> data = h5py.File(filename, 'r') your data: To access the sheet names, use the sheet_names attribute: >>> import numpy as np >>> data.sheet_names >>> import pandas as pd Matlab Files Help SAS Files >>> import scipy.io >>> filename = 'workspace.mat' >>> from sas7bdat import SAS7BDAT >>> mat = scipy.io.loadmat(filename) >>> np.info(np.ndarray.dtype) >>> help(pd.read_csv) >>> with SAS7BDAT('urbanpop.sas7bdat') as file: df_sas = file.to_data_frame()
Text Files Exploring Dictionaries
Stata Files Accessing Elements with Functions Plain Text Files >>> data = pd.read_stata('urbanpop.dta') >>> print(mat.keys()) Print dictionary keys >>> filename = 'huck_finn.txt' >>> for key in data.keys(): Print dictionary keys >>> file = open(filename, mode='r') Open the file for reading print(key) >>> text = file.read() Read a file’s contents Relational Databases meta quality >>> print(file.closed) Check whether file is closed >>> from sqlalchemy import create_engine strain >>> file.close() Close file >>> print(text) >>> engine = create_engine('sqlite://Northwind.sqlite') >>> pickled_data.values() Return dictionary values >>> print(mat.items()) Returns items in list format of (key, value) Use the table_names() method to fetch a list of table names: tuple pairs Using the context manager with >>> with open('huck_finn.txt', 'r') as file: >>> table_names = engine.table_names() Accessing Data Items with Keys print(file.readline()) Read a single line print(file.readline()) Querying Relational Databases >>> for key in data ['meta'].keys() Explore the HDF5 structure print(file.readline()) print(key) >>> con = engine.connect() Description >>> rs = con.execute("SELECT * FROM Orders") DescriptionURL Table Data: Flat Files >>> df = pd.DataFrame(rs.fetchall()) Detector >>> df.columns = rs.keys() Duration GPSstart Importing Flat Files with numpy >>> con.close() Observatory Files with one data type Using the context manager with Type UTCstart >>> filename = ‘mnist.txt’ >>> with engine.connect() as con: >>> print(data['meta']['Description'].value) Retrieve the value for a key >>> data = np.loadtxt(filename, rs = con.execute("SELECT OrderID FROM Orders") delimiter=',', String used to separate values df = pd.DataFrame(rs.fetchmany(size=5)) df.columns = rs.keys() skiprows=2, usecols=[0,2], Skip the first 2 lines Read the 1st and 3rd column Navigating Your FileSystem dtype=str) The type of the resulting array Querying relational databases with pandas Magic Commands Files with mixed data types >>> df = pd.read_sql_query("SELECT * FROM Orders", engine) >>> filename = 'titanic.csv' !ls List directory contents of files and directories >>> data = np.genfromtxt(filename, %cd .. Change current working directory %pwd Return the current working directory path delimiter=',', names=True, Look for column header Exploring Your Data dtype=None) NumPy Arrays os Library >>> data_array = np.recfromcsv(filename) >>> data_array.dtype Data type of array elements >>> import os >>> data_array.shape Array dimensions >>> path = "/usr/tmp" The default dtype of the np.recfromcsv() function is None. >>> wd = os.getcwd() Store the name of current directory in a string >>> len(data_array) Length of array >>> os.listdir(wd) Output contents of the directory in a list Importing Flat Files with pandas >>> os.chdir(path) Change current working directory pandas DataFrames >>> os.rename("test1.txt", Rename a file >>> filename = 'winequality-red.csv' "test2.txt") >>> data = pd.read_csv(filename, >>> df.head() Return first DataFrame rows nrows=5, >>> os.remove("test1.txt") Delete an existing file Number of rows of file to read >>> df.tail() Return last DataFrame rows >>> os.mkdir("newdir") Create a new directory header=None, Row number to use as col names >>> df.index Describe index sep='\t', Delimiter to use >>> df.columns Describe DataFrame columns comment='#', Character to split comments >>> df.info() Info on DataFrame na_values=[""]) String to recognize as NA/NaN >>> data_array = data.values Convert a DataFrame to an a NumPy array DataCamp Learn R for Data Science Interactively