You are on page 1of 26

Introduction

In this event, we must forecast microbusiness development throughout numerous nations and
states based on the density of microbusinesses in each country. Microbusinesses are frequently
too tiny or too new to appear in typical economic information sources of information, although
their activity may be tied to other economic variables that are of broader relevance. Country
policymakers attempt to create more inclusive and recession-resistant economies. We are
additionally conscious that, because of technological advancements, entrepreneurship is no
longer easier to achieve than it has become nowadays. Perhaps it's to achieve a greater
equilibrium between work and life, explore a passion, or replace lost employment, statistics show
that Americans are becoming more inclined to start a company of their own to suit their needs
and financial objectives. The problem is that these "microenterprises" are frequently too small or
too recently established to emerge in regular economic data sources, making it difficult for
policymakers to investigate them.

Significance of project

Our analysis assists policymakers in understanding more about microenterprises, or an increasing


trend of extremely small companies. This additional data will aid in the development of novel
programs and policies to improve the performance and effect of microenterprises.

Dataset Overview

The business activity index, commerce and industry dataset, census dataset, and unique
microbusiness density dataset are the main components of our dataset. The following is a
forecasting competition, given historical economic data is freely available. Data acquired after
the deadline for submitting information expires will be used to calculate the forecasting phase
public leaderboard or final confidential leaderboard. We produce static predictions that will only
include information accessible when the submission time ends. It also implies that, although
we're going to rescore contributions during the forecasting time frame, notebook will be
replayed.

train.csv
test.csv

Data cleaning
EDA

Finding the total number of country, state, and census :

Finding the First day of month


Hence, the total number of countries is 1871, 51 states, cfis country is 3135 with 3142 total
census.

Finding outliers:

Shift point and outlier detection are key approaches for time series analysis since they can aid in
the identification of major changes or irregularities in the data. Time series data are frequently
non-stationary, which means that their statistical features fluctuate with time. These
modifications can be caused by a variety of sources, including alterations in underlying patterns,
modifications to the flow of data, or the appearance of unusual events or aberrations. Changing
frequency detection can assist in determining when these modifications occur as well as provide
information about the underlying reasons of the changes.
We can observe that the data is distributed unevenly throughout the columns:

 The columns "microbusiness_density" & "active" appear as though they are skewed
toward the left, due to certain outliers.

 The accessibility to online resources is comparatively high and steady throughout


different regions, with around 80% of individuals having accessibility.

 With approximately twelve percent of the general population possessing a higher


education, the proportion of the community with a college degree is rather low & skewed
towards the left.

 The proportion of the population born outside is quite low, implying that few individuals
migrate to these places, resulting in lower turnover of population in low circles.

 The IT sector employs a sizable proportion of the workforce, creating opportunities for
these industries to produce services linked to technology and assist people in better
understanding technology.

Building of Models

KNN model

For boosting methods, we constructed the following features: lag features, basic features,
encoding features, and rank features. We then built features based on their neighbors. The k-NN
model is based on several attributes, such as census data, microbusiness densities data, and their
corresponding changes. The k-NN features improved our training loss and validation loss. The
validation loss for LightGBM decreased by around 3% to approximately 2.3, showing that
capturing neighbors as features for predictions is highly effective.

SMAPE

We used SMAPE (Symmetric Mean Absolute Percentage Error) as our time series evaluation
metric. It is commonly used for time series tasks and, as you know, penalizes underpredictions
more than overpredictions. There are other commonly used evaluation metrics that do the same,
such as Mean Average Percentage Error (MAPE). However, MAPE has its limitations:

1. It cannot be used if there are zero or close-to-zero values, as division by zero or small values
will tend to infinity.
2. Forecasts that are too low will have a percentage error that cannot exceed 100%, but for
forecasts that are too high, there is no upper limit to the percentage error. This means that the
evaluation metric will systematically select a method whose forecasts are low.

In contrast to MAPE, SMAPE has both a lower and an upper bound. The log-transformed
accuracy ratio of MAPE actually has a similar shape compared to SMAPE. In our task, when
considering Type 1 and Type 2 errors for prediction, we would rather optimize for Type 2 errors
to penalize our predictions for microbusiness densities that are far below true values. From a
resource allocation perspective, underpredicting microbusiness densities can be harmful to those
doing business since they won't receive the necessary resources. SMAPE penalizes
underpredictions more than log-transformed MAPE, which is why we chose it.

Previously, we identified some key features contributing to our training. Some examples include
the percentage of households with broadband access and the percentage of the population aged
25+ with a college degree. These features suggest that some economic indicators may be
correlated with microbusiness densities, so we added more external datasets to our model. With
these features added, the loss improved by another 3%. Improving the model solely through
feature engineering is challenging. We already transformed our target to the log difference.

The answer lies in outliers. For time series tasks, models are very sensitive to outliers. There are
many outliers in our target across timestamps, especially for counties with lower populations.
Some of the smallest counties have less than 1,000 people, and if one person suddenly decides to
start microbusiness, the density can change drastically.

Anomaly detection significantly improved our model, with the SMAPE loss improving by
around 30%. This demonstrates the importance of smoothing in some time series tasks.
Smoothing will be applied to data for all models moving forward, which will also be shown in
other models.
LightGBM & XGBoost

After applying some tricks, here are the feature importance from LightGBM and XGBoost. With
only the top 30 features, the model performs as well as it does with the full dataset. First, we aim
to find a lower-dimensional representation of the distances between targets:
 SMAPE is asymmetric in terms of both over and predictions.
 Over-forecasting by the same proportion results in a lesser loss. Over-forecasting is
desirable when the inaccuracy is substantial.
 SMAPE has a maximum value of 200 (MAPE without over-forecasting has no
restriction).

There are approximately 1.95% of the biggest SMAPE losses that account for fifty percent of the
total loss. In reality, this issue is not a significant skew favoring the biggest losses; typically,
20% of the entire loss accounts for roughly 80% of the total amount of loss.
The plot above illustrates the main aspects, which are typically gorgeous and provide the
appearance of sophisticated analytics and data knowledge, therefore it's impossible not to include
one.

We can observe that the data is highly noisy, as well as elaborate baselines do not significantly
enhance the Last Value baseline. As an outcome, fairly simple logic can produce good outcomes.

The subsequent baseline's key idea is to build many groups of clips and simply predict by
multiplying the Last number by a certain variable. Losses on the specified train time determine
the correct multiplying of all cfips.
• For this level of competition, it can be challenging to anticipate target values higher than
a simple baseline. Additionally, there is a considerable probability that a trivial answer
will attain one of the top spots on the private ranking by accident.
• However, overfitting towards the public scoreboard provides a benefit since the public
information will not be published, and the general score is a method to learn anything
about the January data that is already available.
• A lot of baselines just forecast the following month and copy the same number to other
individuals; be cautious if you wish to combine public contributions alongside your
intricate trend-seasonality logic.

DBSCAN clustering

The DBSCAN algorithm groups points according to distance, often the Euclidean distance and
the smallest quantity of elements. That algorithm's key feature is that it assists us in locating
outliers than points within low-density zones; thus, this cannot be as sensitive to aberrations as
K-Means grouping. After forming the initial cluster, we analyze all of its component points to
determine their Eps -neighbors. When an individual possesses at least MinPoints Eps-neighbors,
the initial cluster is expanded by including those Eps-neighbors. The above procedure is repeated
until there are currently no additional endpoints to add this particular cluster.

The aforementioned result indicates that there are actually no missing values in the data that we
have. Let us collect the Annual Income and then Spending Score columns from our already
configured information and implement our DBSCAN algorithm across them.

• Assuming the collection of data contains two dimensions, specify the minimum sample
size per cluster to four.
• When the information includes a number of dimensions, the minimum number of
samples per cluster ought to be as follows: Min_sample(MinPoints) = 2 * Data dimension

Because our data is two-dimensional in nature we'll leave the MinPoint argument at its standard
setting of 4.

Calculate data closeness utilizing KNN and sorting values:

To compute Eps, we will use the KNN function to find the distance across every information
point and it’s KNN. After that, we sort them before plotting them. The plot identifies the
maximum number at the graph's curvature. This constitutes our Eps value..
Clusters plot
This is UMAP (Uniform Manifold Approximation and Projection), which relies on Riemannian
geometry of the manifold and algebraic topology. It uses fuzzy simplicial sets to approximate the
underlying structure of the data. This method captures both local and global structures by
considering the distances between data points in the high-dimensional space and building a
topological representation of the data. The loss function for UMAP is the cross-entropy between
the pairwise similarities in the high-dimensional space (P) and the low-dimensional space (Q).
To model the local and global structure of the data, high-dimensional pairwise similarities are
based on the distance metric, and low-dimensional pairwise similarities are based on the negative
exponential of the distance in the embedding space. We chose to use UMAP mainly because it is
better at preserving the global structure of the data, while t-SNE focuses more on local structure.
UMAP is more consistent due to its deterministic initialization compared to the random
initialization in t-SNE, and it scales better. So, our second approach involves finding neighbors
based on clusters using UMAP. The clusters seem to be reasonable, and in fact, there is a
significant overlap of neighbors, similar to k-NN, from the previous approach. After finding the
neighbors, we can use these neighbors to construct graphs for graph neural networks. For each
county, the counties' longitude and latitude are appended for distance calculation. Considering
each county as the source, the destinations are the neighbors, and the weights are the normalized
distances based on their geographical data. Since we are predicting MD on a monthly level, a
graph is generated for each month in the training data. Here is our model architecture: it consists
of 1, 2, 3... layers. On the right is our training process, where early stopping is employed, and the
validation loss is actually very low, resulting in a performance of SMAPE around 0.9. The
reason the model performs well with neighbors is due to graph convolution.

GCN

Graph convolution is an effective technique for processing graph-structured data, In a graph


convolutional neural network (GCN), the node features are updated based on the features of their
neighbors, allowing the model to learn a rich and expressive representation of the nodes in the
graph. This is especially useful when dealing with spatial data, as it captures the local
dependencies and relationships between the neighboring nodes.
In our case, the model performs well because the graph convolution layers can efficiently capture
and learn the spatial relationships between neighboring counties. This leads to a high
performance of our model, however, during training; we trained the model on very small batch
size, we constraints the performance. GCN generalizes the idea of convolutions from grid data to
graph data. In contrast, there are 1D CNNs, which are particularly effective for time series
analysis. They employ convolutional layers with filters that slide along the input data, capturing
local patterns in the sequence. These filters are able to automatically detect and learn features,
extracting meaningful information from the data without requiring manual feature engineering.

Model architecture:

The GRU layer is used after the 1D convolution, by adding so; we can model the long-range
dependencies and temporal relationships in the data. The combination of 1D CNN and GRU
layers takes advantage of the strengths of both architectures. The 1D CNN captures local patterns
and features in the data, while the GRU layer models the long-range temporal relationships. This
combination helps in obtaining a richer representation of the time series data, leading to better
predictions. The 1D CNN + GRU model provides similar performance to the GNN, and it is
important to note that we are only using microbusiness densities for prediction. This model
architecture has potential for a wide range of time series tasks, not just for MD prediction.

A new class was also verified to corroborate the general target classification efficacy of the RCS.
As seen below, the 1D CNN-GRU achieved the best accuracy in classification with 99%
validation loss and 975% training loss.
Visualizing cross-validation behavior

Determining the appropriate cross-validation element is a critical step in correctly fitting a


model. There are numerous methods for dividing data into sets for training and testing with the
goal to prevent model overfitting, standardizing the total amount of categories within test sets,
and so on.
As you observe, the KFold cross-validation iterate doesn't take the information points category
or group into account by the standard. We may alter such by doing one of the following:

• StratifiedKFold to save the sample percentages for every single class.


• GroupKFold to prevent the same group from appearing in two distinct folds.
• StratifiedGroupKFold to retain the GroupKFold restriction despite seeking to recover the
folds with stratification.

We'll create an algorithm which enables us to see how each cross-validation entity behaves.
We'll divide the data into four parts. On every parted way, we'll show the indices picked for both
the training (blue) and the test (red) sets.).
Conclusions

Building multiple models in this research, we discover found the information on


microbusiness_density is pretty evenly split among areas while some locations, such as the state
of Wyoming, Nevada, or South Dakota, possess a relatively high number of tiny firms. These
constitute outliers, and their presence tends to skew the mean and median. After these outliers
are removed, the geographical distribution such the total amount of firms becomes more level,
while there are still certain locations with a handful of small enterprises, among them the states
of West Virginia and Maine.

References

You might also like