You are on page 1of 6

import pandas as pd

import numpy as np

Search data in .txt, .tsv,.csv, .pkl, .excel, SAS, Stata, .pdf, json file extensions, download it

Import data from different file extensions listed in the first task. Explain why you selected to use numpy and pandas or other library in a
comment

# For text files like .csv, .txt , .tsv, we use pandas for easy import:

with open('as4.txt', 'r') as file:


txt = file.read()
print(txt)

hello 1 2 3

data = pd.read_csv('mnist_test.csv')
data.head()

label 1x1 1x2 1x3 1x4 1x5 1x6 1x7 1x8 1x9 ... 28x19 28x20 28x21 28x22 28x23 28x24 28x25 28x26 28x27 28x28

0 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

1 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

4 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 785 columns

# Read tsv file using read_csv with tab


data = pd.read_csv('demo.tsv', delimiter='\t')
data.head()

output unit,sex,age,geo\time 2021 2020 2019 2018 2017 2016 2015 2014 2013 ... 1969 1968 1967 1966 1965 1964 1963 1962 1

0 YR,F,Y1,AL : 79.4 80.4 80.2 79.7 79.8 79.2 79.8 79.6 ... : : : : : : : :

1 YR,F,Y1,AM : : 79.1 79.2 78.5 78.0 77.9 : : ... : : : : : : : :

2 YR,F,Y1,AT : 82.9 83.5 83.3 83.2 83.4 83.0 83.3 83.0 ... : : : : : : : :

3 YR,F,Y1,AZ : : 78.6 78.1 77.7 : 77.5 77.0 76.9 ... : : : : : : : :

4 YR,F,Y1,BE : 82.3 83.6 83.2 83.2 83.2 82.6 83.1 82.5 ... 74.4 74.3 74.7 74.4 74.3 74.6 73.9 74.0 7

5 rows × 63 columns

#read excel file with pandas. For excel also pandas is easy for import
excel=pd.read_excel('Financial Sample.xlsx')
excel.head()

Discount Units Manufacturing Sale Gross


Segment Country Product Dis
Band Sold Price Price Sales

0 Government Canada Carretera None 1618.5 3 20 32370.0

1 Government Germany Carretera None 1321.0 3 20 26420.0

2 Midmarket France Carretera None 2178.0 3 15 32670.0

3 Midmarket Germany Carretera None 888.0 3 15 13320.0

4 Midmarket Mexico Carretera None 2470.0 3 15 37050.0

#for json files,we use special json library


import json
with open('iris.json', 'r') as file:
json = json.load(file)
json
[{'sepalLength': 5.1,
'sepalWidth': 3.5,
'petalLength': 1.4,
'petalWidth': 0.2,
'species': 'setosa'},
{'sepalLength': 4.9,
'sepalWidth': 3.0,
'petalLength': 1.4,
'petalWidth': 0.2,
'species': 'setosa'},
{'sepalLength': 4.7,
'sepalWidth': 3.2,
'petalLength': 1.3,
'petalWidth': 0.2,
'species': 'setosa'},
{'sepalLength': 4.6,
'sepalWidth': 3.1,
'petalLength': 1.5,
'petalWidth': 0.2,
'species': 'setosa'},
{'sepalLength': 5.0,
'sepalWidth': 3.6,
'petalLength': 1.4,
'petalWidth': 0.2,
'species': 'setosa'},
{'sepalLength': 5.4,
'sepalWidth': 3.9,
'petalLength': 1.7,
'petalWidth': 0.4,
'species': 'setosa'},
{'sepalLength': 4.6,
'sepalWidth': 3.4,
'petalLength': 1.4,
'petalWidth': 0.3,
'species': 'setosa'},
{'sepalLength': 5.0,
'sepalWidth': 3.4,
'petalLength': 1.5,
'petalWidth': 0.2,
'species': 'setosa'},
{'sepalLength': 4.4,
'sepalWidth': 2.9,
'petalLength': 1.4,
'petalWidth': 0.2,
'species': 'setosa'},
{'sepalLength': 4.9,
'sepalWidth': 3.1,
'petalLength': 1.5,
'petalWidth': 0.1,
'species': 'setosa'},
{'sepalLength': 5.4,
'sepalWidth': 3.7,
'petalLength': 1.5,
'petalWidth': 0.2,
'species': 'setosa'},
{'sepalLength': 4.8,
'sepalWidth': 3.4,
'petalLength': 1.6,

pip install sas7bdat

Collecting sas7bdat
Downloading sas7bdat-2.2.3.tar.gz (16 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: six>=1.8.0 in /usr/local/lib/python3.10/dist-packages (from sas7bdat) (1.16.0)
Building wheels for collected packages: sas7bdat
Building wheel for sas7bdat (setup.py) ... done
Created wheel for sas7bdat: filename=sas7bdat-2.2.3-py3-none-any.whl size=16291 sha256=688b7602e14330f22ceeba0f77d2bc32d1712c4ad41
Stored in directory: /root/.cache/pip/wheels/d2/ad/aa/badcd17bd07e0df1adfc85e738acc942787648fb7ed4044543
Successfully built sas7bdat
Installing collected packages: sas7bdat
Successfully installed sas7bdat-2.2.3

#for SAS files, we use specialized library like sas7bdat


from sas7bdat import SAS7BDAT
sas = 'airline.sas7bdat'
with SAS7BDAT(sas) as file:
sas_data = file.to_data_frame()
sas_data.head()
YEAR Y W R L K

0 1948.0 1.214 0.243 0.1454 1.415 0.612

1 1949.0 1.354 0.260 0.2181 1.384 0.559

2 1950.0 1.569 0.278 0.3157 1.388 0.573

3 1951.0 1.948 0.297 0.3940 1.550 0.564

4 1952.0 2.265 0.310 0.3559 1.802 0.574

data = pd.read_stata('Youtube.dta')
data.head()

video_id title channel_title publish_time views likes

YouTube
Rewind: The
YouTube 2017-12-
0 FlsCjmMhFmw Shape of 2017 149376127.0 3093544.0
Spotlight 06T17:58:51.000Z
|
#YouTubeRe...

YouTube
Rewind: The
YouTube 2017-12-
1 Fl Cj MhF Sh f 2017 137843120 0 3014471 0

import pickle
#for pickle files, we can use pandas or numpy: data = pd.read_pickle(...) or data = np.load(... ,allow_pickle=True)

with open('ukraine_war.pkl', 'rb') as file:


pkl = pickle.load(file)
pkl

Datetime Tweet Id Text Username

#TikTok #russia
2022-05-23
0 1528888293852872704 #refugeecrisis #ukraine Danaya_Pashneya
23:59:09+00:00
#famin...

Russian diplomat to
2022-05-23
1 1528888224713777154 U.N. Boris Bondarev MJoyce2625
23:58:52+00:00
resign...

Ukrainian Presidential
2022-05-23
2 1528888224197660672 Office discloses how knittingknots
23:58:52+00:00
ma...

Ukraine War: The battle


2022-05-23
3 1528888221907636224 for Severodonetsk will385
23:58:52+00:00
http...

Alexander Lukashenko
2022-05-23
4 1528888200696975360 reminds me of an i__heart__this
23:58:47+00:00
abused s...

... ... ... ... ...

@akshayalladi China
2022 05 22

pip install PyPDF2

Collecting PyPDF2
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 232.6/232.6 kB 4.7 MB/s eta 0:00:00
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1

from PyPDF2 import PdfReader


reader = PdfReader('file.pdf')
print (len (reader. pages))
page = reader. pages [0]
text = page. extract_text ()
text
4
'Lorem ipsum \nLorem ipsum dolor sit amet, consectetur adipiscing \nelit. Nunc ac fa
ucibus odio. \nVestibulum neque massa, scelerisque sit amet ligula eu, congue molest
ie mi. Praesent ut\nvarius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac do
lor vitae odio interdum\ncondimentum. Vivamus dapibus sodales ex, vitae malesu
ada ipsum cursus\nconvallis. Maecenas sed egestas nulla, ac condimentum orci. Mau
ris diam felis,\nvulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac
ligula semper, nec luctus\nnisl blandit. Integer lacinia ante ac libero lobortis imp

Open each spreadsheet of excel

file = 'final.xlsx'
data = pd.ExcelFile(file)
data.sheet_names

['Disclaimer',
"What's new in V3",
'Front page',
'A1. Headline series',
'TableofContent',
'A2. Pop of Eng & GB 1086-1870',
'A3. Eng. Agriculture 1270-1870',
'A4. Ind Production 1270-1870',
'A5. Service Sector 1270-1870',
'A6. English GDP(O) 1270-1700',
'A7. GB GDP(O) 1700-1870',
'Notes on GDP estimates',
'A8. UK Real GDP(A)',
'A9. Nominal GDP (A)',
'A10. GNP and National Saving',
'A11. GDP(E) components - values',
'A12. GDP(E) components - vols',
'A13. GDP(E) contributions',
'A14. Real GDP(O) components',
'A15. Factor incomes by SIC ',
'A16. Industry GVA shares by SIC',
'A17. GDP(I) components',
'A18. Population 1680+',
'A19. Migration flows',
'A20. Migration by citizenship',
'A21. GDP per capita 1086+',
'A22. Coin in circulation',
"A23. Bank of England B'Sheet",
'A24. Monetary aggregates',
'A25. Credit aggregates',
'A25a. Bills of Exchange',
'A26. Central govt 1290-1689',
'A27. Central govt borrowing ',
'A28. Public Sector Borrowing',
'A28a. Public Sector Spending',
'A29. The National Debt',
'A30a. Nat Debt mkt vals 1727',
'A30b. Nat Debt mkt vals 1900-',
'A30c. Term Annuities 1694-1836',
'A31. Interest rates & asset ps ',
'A32. Property prices & rent',
'A33. Exchange rate data',
'A34a. Trade volumes 1280-',
'A34b. Off. trade values 1697-',
'A35. Trade volumes and prices',
'A36. Trade values and BOP',
'A37. Trade breakdown 1663-1701',
'A38. Trade breakdown 1699-1774',
'A39. Trade - Goods trade 1774',
'A40. Trade by region 1710-1822',
'A41. Trade by region 1784+',
'A42. Regional Trade summary',
'A43. Trade - by Trade Area',
'A44. Trade by Country',
'A45. UK Shares of world trade',
'A46. Net external assets',
'A47. Wages and prices',
'A48. Real Earnings ',
' 9 G l t i C 8th'

Parse your spreadsheets and use additional arguments to skip rows

df = data.parse('A64. Sector Balance Sheets', skiprows=5)


df
Unnamed: Unnamed: Unnamed: Unnamed: Unnamed: Unnamed: Unnamed: Unnamed: Unnamed: Unnamed: Unnamed: Unname
Households
1 2 3 4 5 6 7 8 9 10 11

Securities Securities
Total Currency Shares Insurance Other Total Currency
other other
0 NaN financial and Loans and other technical accounts financial and Loa
than than
assets deposits equities reserves receivable liabilities deposits
shares shares

1 1957 43 14 4 NaN 11 10 4 NaN 7 NaN NaN N

2 1958 50 15 5 NaN 15 11 4 NaN 8 NaN NaN N

3 1959 59 16 5 NaN 22 12 5 NaN 9 NaN NaN N

4 1960 58 17 4 NaN 20 13 5 NaN 10 NaN NaN N

... ... ... ... ... ... ... ... ... ... ... ... ...

176 2002 3369 1442 500 773 639 13 2 NaN 3330 1298 587 5

177 2003 3767 1608 572 837 737 10 2 NaN 3720 1475 612 5

178 2004 4288 1858 672 943 803 11 2 NaN 4175 1644 685 7

179 2005 5274 2202 832 1252 971 14 2 NaN 5130 2080 793 8

180 2006 5789 2374 956 1301 1141 15 2 NaN 5520 2213 909 9

Import PDF data as a dataframe

pip install tabula-py

import pandas as pd
from tabula import read_pdf

pdf_file = 'file.pdf'

#we use tabula to read the pdf file and convert it to a DataFrame
df_list = read_pdf(pdf_file, pages='all')
#then combine all of them
df = pd.concat(df_list, ignore_index=True)
df
WARNING:tabula.backend:Got stderr: Feb 15, 2024 8:18:17 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
WARNING: Using fallback font LiberationSans for base font Symbol
Feb 15, 2024 8:18:17 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
WARNING: Using fallback font LiberationSans for base font ZapfDingbats

You might also like