You are on page 1of 16

Lecture 1:Set up Jupyter, Import

Data from web and Select Cases


Installing packages

• Conda install pandas-datareader


Reading data from web
• import pandas as pd
• import pandas_datareader.data as web
• df = web.DataReader('AAPL',
data_source='yahoo',
• start='1/1/2010', end='3/21/2017')
• df.to_csv('AAPL.csv')
• df.tail()
Reading data
• import pandas as pd
• df =pd.read_csv(‘d:/CSR_user_timeline_2013.csv')
• print (len(df))
• df.head(2)
List all the columns in the
DataFrame
• df.columns
• len(df.columns)
• Data types
• df.dtypes
Remove Unneeded Columns
• df = df.drop('created_at_text',1)
• df = df.drop('tweet_id',1)
• df = df.drop('withheld_in_countries',1)
• df = df.drop('withheld_scope',1)
• df = df.drop('truncated',1)
• df = df.drop('possibly_sensitive',1)
• len(df.columns)
• df.head(2)
• If you have only a few columns to delete you can use
the drop command as shown above. On the other hand,
if you only want to keep a few columns, you can create
a new version of the dataframe with only those
columns you like. Note that the double square brackets
-- "[[...]]" -- in PANDAS forms a dataframe
representation. In the following example, I am creating
a new dataframe with only three variables. You can see
that this new dataframe has the same number of
tweets but fewer columns (variables).
Creating new data fram
• df2 = df[['created_at',
'from_user_screen_name', 'retweet_count']]
print len(df2)
• df2.head(2)
• len(df.columns)
View Twitter Accounts Represented in DF
• We can use the unique function to find
how many unique Twitter accounts are
represented in the dataset. First, I'll show
you what unique function does -- it creates
an array of all the screen_names of the
Twitter accounts.
• pd.unique(df.from_user_screen_name.ravel())
• len(pd.unique(df.from_user_screen_name.ravel()))

• Remove Tweets from One Specific Account
• We want to get rid of all tweets by TICalculators from the
dataframe. Unlike the other 41 Twitter accounts in the
dataset, this account is not a CSR-related account. First,
we can use the len function combined with a dataframe
query to count the number of tweets that are not sent by
TICalculators
• len(df[df['from_user_screen_name'] != 'TICalculators'])
• We should then also check how many tweets are sent by
TICalculators: 1,767
• len(df[df['from_user_screen_name'] == 'TICalculators'])

• We can use Python to do "math." Let's use this to show
whether the two numbers returned in the above steps
add up to the total number of tweets in our dataframe.
• 1767 + 32330
We can also do this another way
• (1767 + 32330) - len(df)
Or even
• df = df[df['from_user_screen_name'] != 'TICalculators']
• print len(df)
• df.head(2)
• Now let's check again for all the unique accounts in
the dataframe -- as you can see, TICalculators is
gone and there are now 41 accounts.
• pd.unique(df.from_user_screen_name.ravel())

• len(pd.unique(df.from_user_screen_name.ravel()))
• Transferring data into figure
• df['retweet_count'].plot(legend=True,
figsize=(12, 8), title='user_screen',
label='twitter')
• Count function
• df.count()

• Max function
• df.max()

• Min function
• df.min()
ASSIGNMENT 1

1.Prompt maximum number of


retweet_count.
2. Prompt average number of retweet_count
3. Create figure from column of created_at.
4. Display unique numbers in created_at.
5. Read data about AAPL finance history
from 2012 till 2018.
6. Create data frame for rowid, query,
tweet_id_str.

You might also like