Professional Documents
Culture Documents
import numpy as np
Search data in .txt, .tsv,.csv, .pkl, .excel, SAS, Stata, .pdf, json file extensions, download it
Import data from different file extensions listed in the first task. Explain why you selected to use numpy and pandas or other library in a
comment
# For text files like .csv, .txt , .tsv, we use pandas for easy import:
hello 1 2 3
data = pd.read_csv('mnist_test.csv')
data.head()
label 1x1 1x2 1x3 1x4 1x5 1x6 1x7 1x8 1x9 ... 28x19 28x20 28x21 28x22 28x23 28x24 28x25 28x26 28x27 28x28
0 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
output unit,sex,age,geo\time 2021 2020 2019 2018 2017 2016 2015 2014 2013 ... 1969 1968 1967 1966 1965 1964 1963 1962 1
0 YR,F,Y1,AL : 79.4 80.4 80.2 79.7 79.8 79.2 79.8 79.6 ... : : : : : : : :
2 YR,F,Y1,AT : 82.9 83.5 83.3 83.2 83.4 83.0 83.3 83.0 ... : : : : : : : :
4 YR,F,Y1,BE : 82.3 83.6 83.2 83.2 83.2 82.6 83.1 82.5 ... 74.4 74.3 74.7 74.4 74.3 74.6 73.9 74.0 7
5 rows × 63 columns
#read excel file with pandas. For excel also pandas is easy for import
excel=pd.read_excel('Financial Sample.xlsx')
excel.head()
Collecting sas7bdat
Downloading sas7bdat-2.2.3.tar.gz (16 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: six>=1.8.0 in /usr/local/lib/python3.10/dist-packages (from sas7bdat) (1.16.0)
Building wheels for collected packages: sas7bdat
Building wheel for sas7bdat (setup.py) ... done
Created wheel for sas7bdat: filename=sas7bdat-2.2.3-py3-none-any.whl size=16291 sha256=688b7602e14330f22ceeba0f77d2bc32d1712c4ad41
Stored in directory: /root/.cache/pip/wheels/d2/ad/aa/badcd17bd07e0df1adfc85e738acc942787648fb7ed4044543
Successfully built sas7bdat
Installing collected packages: sas7bdat
Successfully installed sas7bdat-2.2.3
data = pd.read_stata('Youtube.dta')
data.head()
YouTube
Rewind: The
YouTube 2017-12-
0 FlsCjmMhFmw Shape of 2017 149376127.0 3093544.0
Spotlight 06T17:58:51.000Z
|
#YouTubeRe...
YouTube
Rewind: The
YouTube 2017-12-
1 Fl Cj MhF Sh f 2017 137843120 0 3014471 0
import pickle
#for pickle files, we can use pandas or numpy: data = pd.read_pickle(...) or data = np.load(... ,allow_pickle=True)
#TikTok #russia
2022-05-23
0 1528888293852872704 #refugeecrisis #ukraine Danaya_Pashneya
23:59:09+00:00
#famin...
Russian diplomat to
2022-05-23
1 1528888224713777154 U.N. Boris Bondarev MJoyce2625
23:58:52+00:00
resign...
Ukrainian Presidential
2022-05-23
2 1528888224197660672 Office discloses how knittingknots
23:58:52+00:00
ma...
Alexander Lukashenko
2022-05-23
4 1528888200696975360 reminds me of an i__heart__this
23:58:47+00:00
abused s...
@akshayalladi China
2022 05 22
Collecting PyPDF2
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 232.6/232.6 kB 4.7 MB/s eta 0:00:00
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
file = 'final.xlsx'
data = pd.ExcelFile(file)
data.sheet_names
['Disclaimer',
"What's new in V3",
'Front page',
'A1. Headline series',
'TableofContent',
'A2. Pop of Eng & GB 1086-1870',
'A3. Eng. Agriculture 1270-1870',
'A4. Ind Production 1270-1870',
'A5. Service Sector 1270-1870',
'A6. English GDP(O) 1270-1700',
'A7. GB GDP(O) 1700-1870',
'Notes on GDP estimates',
'A8. UK Real GDP(A)',
'A9. Nominal GDP (A)',
'A10. GNP and National Saving',
'A11. GDP(E) components - values',
'A12. GDP(E) components - vols',
'A13. GDP(E) contributions',
'A14. Real GDP(O) components',
'A15. Factor incomes by SIC ',
'A16. Industry GVA shares by SIC',
'A17. GDP(I) components',
'A18. Population 1680+',
'A19. Migration flows',
'A20. Migration by citizenship',
'A21. GDP per capita 1086+',
'A22. Coin in circulation',
"A23. Bank of England B'Sheet",
'A24. Monetary aggregates',
'A25. Credit aggregates',
'A25a. Bills of Exchange',
'A26. Central govt 1290-1689',
'A27. Central govt borrowing ',
'A28. Public Sector Borrowing',
'A28a. Public Sector Spending',
'A29. The National Debt',
'A30a. Nat Debt mkt vals 1727',
'A30b. Nat Debt mkt vals 1900-',
'A30c. Term Annuities 1694-1836',
'A31. Interest rates & asset ps ',
'A32. Property prices & rent',
'A33. Exchange rate data',
'A34a. Trade volumes 1280-',
'A34b. Off. trade values 1697-',
'A35. Trade volumes and prices',
'A36. Trade values and BOP',
'A37. Trade breakdown 1663-1701',
'A38. Trade breakdown 1699-1774',
'A39. Trade - Goods trade 1774',
'A40. Trade by region 1710-1822',
'A41. Trade by region 1784+',
'A42. Regional Trade summary',
'A43. Trade - by Trade Area',
'A44. Trade by Country',
'A45. UK Shares of world trade',
'A46. Net external assets',
'A47. Wages and prices',
'A48. Real Earnings ',
' 9 G l t i C 8th'
Securities Securities
Total Currency Shares Insurance Other Total Currency
other other
0 NaN financial and Loans and other technical accounts financial and Loa
than than
assets deposits equities reserves receivable liabilities deposits
shares shares
... ... ... ... ... ... ... ... ... ... ... ... ...
176 2002 3369 1442 500 773 639 13 2 NaN 3330 1298 587 5
177 2003 3767 1608 572 837 737 10 2 NaN 3720 1475 612 5
178 2004 4288 1858 672 943 803 11 2 NaN 4175 1644 685 7
179 2005 5274 2202 832 1252 971 14 2 NaN 5130 2080 793 8
180 2006 5789 2374 956 1301 1141 15 2 NaN 5520 2213 909 9
import pandas as pd
from tabula import read_pdf
pdf_file = 'file.pdf'
#we use tabula to read the pdf file and convert it to a DataFrame
df_list = read_pdf(pdf_file, pages='all')
#then combine all of them
df = pd.concat(df_list, ignore_index=True)
df
WARNING:tabula.backend:Got stderr: Feb 15, 2024 8:18:17 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
WARNING: Using fallback font LiberationSans for base font Symbol
Feb 15, 2024 8:18:17 AM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
WARNING: Using fallback font LiberationSans for base font ZapfDingbats