You are on page 1of 1

Tweets were collected using Twitter's API and a custom web crawler algorithm written in python.

The tweets were filtered using Twitter's own interactive function "@company" to correlate
tweets from target companies, such as "@Dior" and "@Gucci". The time period for the sample
was between 1 January 2020 and 31 December 2020. (Original dataset v1) This dataset contains
tweet ids, sentiment, tweet content, posting time, posting location, etc. (free choice), data
filtering (free restriction).

Target company Twitter accounts: @gucci, @Dior, @CHANEL, @LouisVuitton, @Hermes_Paris

1. the data collected will contain a large number of retweeted tweets resulting in duplication and
the requirement to retain uniqueness.
2. the presence of tweets created by Twitterbots Twitter bots, with the requirement to filter out
bot tweets.
3. filter out useless information such as #hashtags and URLs from tweets, as long as they are
textual.

 Data collection
See above

 Data filtering
In conjunction with the overview of requirements, add the following.
Basing the validity of tweets on whether the user has @ the respective brand.
The number of tweets can be limited to 10-100 (depending on the situation) to ensure that the
quality of the tweets can be used.

 Data pre-processing
In conjunction with the overview of requirements, add the following.
Twitter has problems with bot posting, duplicate posting and spam accounts posting.
Removal of invalid tweets (as described above) using Python (and possibly other computer
techniques) to ensure that the data set is of high quality.
Reducing the computational load by removing stop words that are not useful for analysis (e.g. he,
she, they, it, the, etc.);

You might also like