Professional Documents
Culture Documents
In [2]:
import os
os.getcwd()
In [3]:
from IPython.display import Image
Image(filename='..\imagenes\\pandas.jpg', width=750, height=750)
Out[3]:
Series
A Pandas Series is a one-dimensional array of indexed data.
A series can be created using various inputs like: Array, Dict, Scalar value or constant
In [4]:
import pandas as pd
print(pd.__version__)
1.3.4
In [5]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
type(data)
0 0.25
Out[5]:
1 0.50
2 0.75
3 1.00
dtype: float64
pandas.core.series.Series
Out[5]:
In [6]:
data.index #es una forma de acceder a la propiedad index del objeto Pandas Series data.
#La propiedad index es un objeto que almacena las etiquetas o nombres de
#las filas en un objeto Pandas Series. Obtendrás como resultado un objeto
#que contiene las etiquetas de las filas de la serie.
#Esto significa que la serie tiene etiquetas de fila que van desde 0 hasta 3
#(ya que hay cuatro elementos en la serie) con un incremento de 1.
#En este caso, las etiquetas son generadas
#automáticamente por Pandas porque no se especificaron etiquetas personalizadas.
data.values
type(data.values)
In [7]:
range(0, 4, 1)
list(range(0, 4, 1))
range(0, 4)
Out[7]:
[0, 1, 2, 3]
Out[7]:
In [14]:
help(pd.Series)
data = pd.Series([0.25, 0.5, 0.75, 1.0], index = ['a', 'b', 'c', 'd'])
data
a 0.25
Out[8]:
b 0.50
c 0.75
d 1.00
dtype: float64
In [11]:
population = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860, 'Illinois': 12882135})
population
California 38332521
Out[11]:
Texas 26448193
New York 19651127
Florida 19552860
Illinois 12882135
dtype: int64
DataFrame
is an analog of a two-dimensional array with both flexible row indices and flexible column names.
A pandas DataFrame can be created using various inputs like: Lists, dict, Series, Numpy ndarrays, Another
DataFrame
In [10]:
data = [1,2,3,4,5]
data ; type(data)
df = pd.DataFrame(data)
df ; type(df)
[1, 2, 3, 4, 5]
Out[10]:
list
Out[10]:
Out[10]: 0
0 1
1 2
2 3
3 4
4 5
pandas.core.frame.DataFrame
Out[10]:
In [12]:
df.index
df.columns
RangeIndex(start=0, stop=5, step=1)
Out[12]:
RangeIndex(start=0, stop=1, step=1)
Out[12]:
In [13]:
df.ndim
df.shape
df.size
2
Out[13]:
(5, 1)
Out[13]:
5
Out[13]:
In [20]:
help(pd.DataFrame)
In [14]:
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'], 'Age':[28,34,29,42]}
#Es una forma de crear un diccionario de Python llamado data que contiene dos listas: "Name" y "Ag
#Cada lista representa una columna de datos.
data ; type(data)
df = pd.DataFrame(data)
df
{'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 'Age': [28, 34, 29, 42]}
Out[14]:
dict
Out[14]:
0 Tom 28
1 Jack 34
2 Steve 29
3 Ricky 42
In [15]:
# cambiar nombre de indices de filas
rank1 Tom 28
rank2 Jack 34
rank3 Steve 29
rank4 Ricky 42
In [16]:
df.index
df.columns
In [17]:
data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
#En este código se crea una lista de diccionarios en Python.
#Cada elemento de la lista es un diccionario que representa una fila de datos,
#donde las claves (keys) del diccionario son los nombres de las columnas y
#los valores son los valores de cada columna para esa fila.
data ; type(data)
df = pd.DataFrame(data)
df
[{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
Out[17]:
list
Out[17]:
Out[17]: a b c
0 1 2 NaN
1 5 10 20.0
In [18]:
df.ndim
df.shape
df.size
2
Out[18]:
(2, 3)
Out[18]:
6
Out[18]:
In [19]:
import numpy as np
np.random.seed(2021)
x = np.random.standard_normal((3,2))
x
In [20]:
df = pd.DataFrame(x)
df
Out[20]: 0 1
0 1.488609 0.676011
1 -0.418451 -0.806521
2 0.555876 -0.705504
In [25]:
os.chdir('C:\\Users\\marth\\OneDrive\\Documentos\\Documentos Trabajo 04 Ago 21\\Clases Febrero Jun
os.getcwd()
.csv
In [26]:
dep2016 = pd.read_csv('Depositos_2016.csv')
dep2016.head()
In [27]:
dep2016.tail()
.txt
In [28]:
help(pd.read_csv)
Parameters
----------
filepath_or_buffer : str, path object or file-like object
Any valid string path is acceptable. The string could be a URL. Valid
URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is
expected. A local file could be: file://localhost/path/to/table.csv.
Note: ``index_col=False`` can be used to force pandas to *not* use the first
column as the index, e.g. when you have a malformed file with delimiters at
the end of each line.
usecols : list-like or callable, optional
Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). For example, a valid list-like
`usecols` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``.
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
To instantiate a DataFrame from ``data`` with element order preserved use
``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns
in ``['foo', 'bar']`` order or
``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
for ``['bar', 'foo']`` order.
.. versionadded:: 0.25.0
iterator : bool, default False
Return TextFileReader object for iteration or getting chunks with
``get_chunk()``.
.. versionchanged:: 1.2
.. versionchanged:: 1.2
.. versionchanged:: 1.2
.. versionchanged:: 1.3.0
.. versionadded:: 1.3.0
.. deprecated:: 1.3.0
The ``on_bad_lines`` parameter should be used instead to specify behavior upon
encountering a bad line instead.
warn_bad_lines : bool, default ``None``
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
"bad line" will be output.
.. deprecated:: 1.3.0
The ``on_bad_lines`` parameter should be used instead to specify behavior upon
encountering a bad line instead.
on_bad_lines : {'error', 'warn', 'skip'}, default 'error'
Specifies what to do upon encountering a bad line (a line with too many fields).
Allowed values are :
.. versionadded:: 1.3.0
.. versionchanged:: 1.2
.. versionadded:: 1.2
Returns
-------
DataFrame or TextParser
A comma-separated values (csv) file is returned as two-dimensional
data structure with labeled axes.
See Also
--------
DataFrame.to_csv : Write DataFrame to a comma-separated values (csv) file.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_fwf : Read a table of fixed-width formatted lines into DataFrame.
Examples
--------
>>> pd.read_csv('data.csv') # doctest: +SKIP
In [29]:
df_txt = pd.read_csv("housing.data.txt", header = None, sep = "\s+")
df_txt.head()
Out[29]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
In [12]:
campos = ['CRIM', 'ZN', 'INDUS', 'CHAS','NOX','RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'MEDV']
In [31]:
df_txt.columns
df_txt.columns = campos
In [32]:
df_txt.columns
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
Out[32]:
'PTRATIO', 'B', 'LSTAT', 'MEDV'],
dtype='object')
In [33]:
df_txt.head()
Out[33]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
Excel
In [34]:
# importar archivo .xlsx
clientes = pd.read_excel('Clientes_2016_2017.xlsx')
clientes.head()
SPSS
In [35]:
# !pip install pyreadstat
!pip install pyreadstat
# se puede prescindir este codigo:
# import pyreadstat
In [37]:
ed = pd.read_spss('Employee data.sav')
ed.head()
Out[37]: id sexo fechnac educ catlab salario salini tiempemp expprev minoría
Bases de Datos
In [ ]:
### SQL Server
* Server: 69.164.182.232
* Usuario: XXX_read
* Contraseña: #XXX
* Database: alumno
* Esquema: student
* tabla: anexo7
In [38]:
import pyodbc # conexión a SQL Server por odbc
In [54]:
help(pyodbc.connect)
connect(...)
connect(str, autocommit=False, ansi=False, timeout=0, **kwargs) --> Connection
cnxn = pyodbc.connect('DSN=DataSourceName;UID=user;PWD=password')
DRIVER={SQL Server};SERVER=localhost;DATABASE=testdb;UID=user;PWD=password
Note the use of braces when a value contains spaces. Refer to SQLDriverConnect
documentation or the documentation of your ODBC driver for details.
The connection string can be passed as the string `str`, as a list of keywords,
or a combination of the two. Any keywords except autocommit, ansi, and timeout
(see below) are simply added to the connection string.
connect('server=localhost;user=me')
connect(server='localhost', user='me')
connect('server=localhost', user='me')
The DB API recommends the keywords 'user', 'password', and 'host', but these
are not valid ODBC keywords, so these will be converted to 'uid', 'pwd', and
'server'.
Special Keywords
The following specal keywords are processed by pyodbc and are not added to the
connection string. (If you must use these in your connection string, pass them
as a string, not as keywords.)
autocommit
If False or zero, the default, transactions are created automatically as
defined in the DB API 2. If True or non-zero, the connection is put into
ODBC autocommit mode and statements are committed automatically.
ansi
By default, pyodbc first attempts to connect using the Unicode version of
SQLDriverConnectW. If the driver returns IM001 indicating it does not
support the Unicode version, the ANSI version is tried. Any other SQLSTATE
is turned into an exception. Setting ansi to true skips the Unicode
attempt and only connects using the ANSI version. This is useful for
drivers that return the wrong SQLSTATE (or if pyodbc is out of date and
should support other SQLSTATEs).
timeout
An integer login timeout in seconds, used to set the SQL_ATTR_LOGIN_TIMEOUT
attribute of the connection. The default is 0 which means the database's
default timeout, if any, is used.
In [ ]:
# No tenemos acceso es solo un ejemplo NO EJECUTAR
conexion = pyodbc.connect("DRIVER={SQL Server}; SERVER=69.164.192.245; DATABASE=alumno; UID=sdc_re
In [ ]:
cur = conexion.cursor()
In [ ]:
cur.execute("SELECT * FROM student.anexo6")
In [ ]:
#fetchall(): función de python q permite consultar todas las filas de la tabla
rows = cur.fetchall()
In [ ]:
# mostrar filas
rows[:2]
In [65]:
# lista con los nombres de las columnas, ESTA INFO NOS DEBE DAR EL DBA
et_columns = ["ncl","fnaci","genero","est_civil","csocio","part_reg","tid","nid",
"tpers","domici","rel_lab","cal","calint","cage",
"mon","ccr","tcr","tspr","fot","morg","tea","skcr","cuenta","kvi",
"kre","krf","kve","kju","kco","dak",
"gar_pref","gar_auto","prv_req","prv_cons","kcas","sin","sis","sid",
"tpr","ncpr","ncpa","pcuo","pgra","fven_org","fven_act"]
In [66]:
import numpy as np
In [70]:
np.array(rows[:2])
In [67]:
df_sql_server = pd.DataFrame(np.array(rows), columns = et_columns)
In [68]:
df_sql_server.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16185 entries, 0 to 16184
Data columns (total 45 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ncl 16185 non-null object
1 fnaci 16076 non-null object
2 genero 16185 non-null object
3 est_civil 16185 non-null object
4 csocio 16185 non-null object
5 part_reg 0 non-null object
6 tid 16185 non-null object
7 nid 16183 non-null object
8 tpers 16185 non-null object
9 domici 16185 non-null object
10 rel_lab 16185 non-null object
11 cal 16185 non-null object
12 calint 16185 non-null object
13 cage 16185 non-null object
14 mon 16185 non-null object
15 ccr 16165 non-null object
16 tcr 16185 non-null object
17 tspr 0 non-null object
18 fot 16185 non-null object
19 morg 16176 non-null object
20 tea 16118 non-null object
21 skcr 16026 non-null object
22 cuenta 16185 non-null object
23 kvi 15466 non-null object
24 kre 0 non-null object
25 krf 0 non-null object
26 kve 0 non-null object
27 kju 0 non-null object
28 kco 0 non-null object
29 dak 6877 non-null object
30 gar_pref 0 non-null object
31 gar_auto 0 non-null object
32 prv_req 16063 non-null object
33 prv_cons 16063 non-null object
34 kcas 0 non-null object
35 sin 10953 non-null object
36 sis 3685 non-null object
37 sid 0 non-null object
38 tpr 16185 non-null object
39 ncpr 15831 non-null object
40 ncpa 14130 non-null object
41 pcuo 15984 non-null object
42 pgra 8046 non-null object
43 fven_org 16185 non-null object
44 fven_act 16185 non-null object
dtypes: object(45)
memory usage: 5.6+ MB
In [69]:
df_sql_server.head() #muesta los primeros 5 registros
Out[69]: ncl fnaci genero est_civil csocio part_reg tid nid tpers domici ... sin s
Calle MZ. F
LOTE 14
0 abc1 30914.0 F S AP1255172XTH None DNI 268551242.0 1.0 PARQUE ... None 420.
RESIDENCIAL
MONTEVE...
Calle
NICARAGUA
1 abc2 33763.0 M S AP1000769XTH None DNI 705555647.0 1.0 Mz. H1 Lt. 3 ... 6.84 No
Referencia AH
L...
Calle CA. J F
38 URB.
2 abc3 25944.0 M S AP1255369XTH None DNI 46228110.0 1.0 PACHACAMAC ... 167.3 No
BARRIO 1
SECT...
Calle CA
HERMANOS
3 abc4 24602.0 M S AP1254225XTH None DNI 252009522.0 1.0 AYAR 405 C2- ... 3799.7 No
10 URB LOS
CHANC...
Calle
SALAVERRY
4 abc5 34059.0 M S AP1254240XTH None DNI 872775748.0 1.0 375 CASA A ... None No
Nro..
Referencia...
5 rows × 45 columns
In [71]:
df_sql_server.shape
(16185, 45)
Out[71]:
In [72]:
df_sql_server.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16185 entries, 0 to 16184
Data columns (total 45 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ncl 16185 non-null object
1 fnaci 16076 non-null object
2 genero 16185 non-null object
3 est_civil 16185 non-null object
4 csocio 16185 non-null object
5 part_reg 0 non-null object
6 tid 16185 non-null object
7 nid 16183 non-null object
8 tpers 16185 non-null object
9 domici 16185 non-null object
10 rel_lab 16185 non-null object
11 cal 16185 non-null object
12 calint 16185 non-null object
13 cage 16185 non-null object
14 mon 16185 non-null object
15 ccr 16165 non-null object
16 tcr 16185 non-null object
17 tspr 0 non-null object
18 fot 16185 non-null object
19 morg 16176 non-null object
20 tea 16118 non-null object
21 skcr 16026 non-null object
22 cuenta 16185 non-null object
23 kvi 15466 non-null object
24 kre 0 non-null object
25 krf 0 non-null object
26 kve 0 non-null object
27 kju 0 non-null object
28 kco 0 non-null object
29 dak 6877 non-null object
30 gar_pref 0 non-null object
31 gar_auto 0 non-null object
32 prv_req 16063 non-null object
33 prv_cons 16063 non-null object
34 kcas 0 non-null object
35 sin 10953 non-null object
36 sis 3685 non-null object
37 sid 0 non-null object
38 tpr 16185 non-null object
39 ncpr 15831 non-null object
40 ncpa 14130 non-null object
41 pcuo 15984 non-null object
42 pgra 8046 non-null object
43 fven_org 16185 non-null object
44 fven_act 16185 non-null object
dtypes: object(45)
memory usage: 5.6+ MB
In [ ]:
# tampoco hacer Postgre SQL
Postgre SQL
Nombre del usuario: postgres
Contraseña del usuario: sdc2019PERU
IP público: 69.164.192.245
Nombre de base de datos: profesor
Nombre de la tabla: bupa
In [73]:
#
!pip install psycopg2
Collecting psycopg2
Downloading psycopg2-2.9.3-cp39-cp39-win_amd64.whl (1.2 MB)
Installing collected packages: psycopg2
Successfully installed psycopg2-2.9.3
In [75]:
import psycopg2 #conexión a PostgreSQL
In [76]:
help(psycopg2.connect)
In [77]:
conn = psycopg2.connect("dbname=profesor user=postgres password=XXXXX host=69.164.192.000")
In [78]:
cursor = conn.cursor()
In [79]:
cursor.execute("SELECT * FROM departamento")
In [80]:
# fetchmany(): función que permite consultar determinadas filas de la tabla
data_select = cursor.fetchmany(10)
In [81]:
data_select
type(data_select)
In [82]:
data_m2 = cursor.fetchall()
In [83]:
df_postgreSQL = pd.DataFrame(data_m2, columns = ["nombres","metros2", "precio"])
In [84]:
df_postgreSQL.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 nombres 62 non-null object
1 metros2 62 non-null object
2 precio 62 non-null object
dtypes: object(3)
memory usage: 1.6+ KB
In [85]:
df_postgreSQL.head()
JSON
In [40]:
import json
list
Out[41]:
In [42]:
datos
In [43]:
datos[0]
In [44]:
type(datos[0])
dict
Out[44]:
In [45]:
df_json = pd.DataFrame(datos)
df_json.head()
In [ ]:
Ejercicios
Crear una serie de datos que inicie en 5 y termine en 10
Crear un data frame de datos aleatorios con 3 filas y 4 columnas con una semilla con valor 2023
Importar un archivo csv, otroa rchivo json, un archivo excel y un archivo de texto y convertirlos