You are on page 1of 5

Data Science

Data sets :
https://drive.google.com/file/d/15tIMFSDdQyye_NLf2vT3R8i6yPYjrDQ2/view?usp=sharing

Problems:
1. Data exploratory: explore the data and find interesting stuffs that can be shown from the
data.
2. Machine Learning: pick one topic that is can be explored from the data sets within
following context: (1) product recommendation, (2) revenue uplift, (3) credit scoring, (4)
Clustering, (5) Delivery Performance
Rules:
The idea for the data understanding test is whether the candidate can make sense of a "large"
set of data with limited information provided, process and analyze the dataset to the best of
their ability within the time limit. These challenges are what we in Bubu face in day to day basis,
in which, most of the time we received only large data set without sufficient information (raw
format).

The end product is less of a priority. This test is a tool for you to showcase your data processing
and analysis prowess. You are allowed (encouraged) to do whatever you want with the data;
whatever tool; whatever ML method; the only restriction is that the output that you deliver must
be within the context mentioned above.

Once again, the final product is less of a priority. We want you to brag on your data processing
and analysis prowess.

Timeframe:
You are given at least 1 week to work on these problems, by the end of the week you are
expected to submit your draft/milestone to us in whatever (readable) format. The final version
shall be presented during the interview.

Good Luck!!
Metadata:

Geolocation Dataset
This dataset includes random latitudes and longitudes from a given zip code prefix.

geolocation_olist_public_dataset.csv

● zip_code_prefix: The first three digits of a zip code.


● city: City associated with the zip code.
● state: State/province associated with zip code.
● lat: latitude of a random address with a given zip code_prefix.
● lng: longitude of a random address with a given zip code_prefix.
Unclassified Orders Dataset
This dataset includes 100k rows and 21 features.

● Note that a comment may be repeated if an order has two or more different products.
● An order may also be fulfilled by more than one seller if the customer purchases mor
than one product.
● Some review comments had personal data like phone numbers, so we did a regex
search replacing every group of 3 numbers by '000'. This might mess up with some data
other than phone numbers in the comments.
● All text identifying stores and partners where replaced by the names of Game of Thrones
great houses.

olist_public_dataset_v2.csv

● order_id: unique identifier of the order.


● order_status: Reference to the order status (delivered, shipped, etc).
● order_products_value: Total products price of an order.
● order_freight_value: Total freight value of an order.
● order_items_qty: Total quantity of items purchased in an order.
● order_sellers_qty: Total quantity of sellers that fulfilled an order.
● order_purchase_timestamp: Shows the purchase timestamp.
● order_aproved_at: Shows the payment approval timestamp.
● order_estimated_delivery_date: Shows the estimated delivery date that was informed to
customer at the purchase moment.
● order_delivered_customer_date: Shows the actual order delivery date to the customer.
● customer_id: identifier of customer. Each order always have an unique customer. To
find the unique customers_id see the customers dataset.
● customer_city: Customer city
● customer_state: Customer state/province
● customer_zip_code_prefix: The first three digits of customer zip code.
● product_category_name: The root category of the purchased product, in Portuguese.
● product_name_lenght: Number of characters extracted from the purchased product
name.
● product_description_lenght: Number of characters extracted from the purchased
product description.
● product_photos_qty: Number of purchased product published photos.
● product_id: unique identifier of product.
● review_id: unique identifier of review
● review_score: Note ranging from 1 to 5 given by the customer on a satisfaction survey.
● review_comment_title: Comment title from the review left by the customer, in
Portuguese.
● review_comment_message: Comment message from the review left by the customer, in
Portuguese.
● review_creation_date: Shows the date in which the satisfaction survey was sent to the
customer.
● review_answer_timestamp: Shows satisfaction survey answer timestamp.
Customers Dataset
This dataset includes unique identifiers of customers.
olist_public_dataset_v2_customers.csv

● customer_id: key to the orders dataset. Each order have an unique customer_id.
● customer_unique_id: unique identifier of a customer.

Payment Dataset
This dataset includes data about the payment options from orders.
olist_public_dataset_payments.csv

● order_id: unique identifier of an order.


● installments: quantity of installments chosen by the customer.
● sequential: a customer may pay an order with more than one payment method. If he
does so, a sequence will be created to accommodate all payments.
● payment_type: method of payment chosen by the customer.
● value: transaction value.

Category Name Translation


Translates the product_category_name to english.

You might also like