You are on page 1of 10

Top 20 Latest Research Problems in Big Data and Data Science

Problem statements in 5 categories, research methodology and research labs to follow

Even though Big data is in the mainstream of operations as of 2020, there are still potential
issues or challenges the researchers can address. Some of these issues overlap with the data
science field. In this article, the top 20 interesting latest research problems in the combination
of big data and data science are covered based on my personal experience (with due respect to
the Intellectual Property of my organizations) and the latest trends in these domains [1,2].
These problems are covered under 5 different categories, namely

Core Big data area to handle the scale

Handling Noise and Uncertainty in the data

Security and Privacy aspects

Data Engineering

Intersection of Big data and Data science

The article also covers a research methodology to solve specified problems and top
research labs to follow which are working in these areas.

I encourage researchers to solve applied research problems which will have more impact on
society at large. The reason to stress this point is that we are hardly analyzing 1% of the
available data. On the other hand, we are generating terabytes of data every day. These
problems are not very specific to a domain and can be applied across the domains.
Let me first introduce 8 V’s of Big data (based on an interesting article from Elena), namely
Volume, Value, Veracity, Visualization, Variety, Velocity, Viscosity, and Virality. If we
closely look at the questions on individual V’s in Fig 1, they trigger interesting points for the
researchers. Even though they are business questions, there are underlying research problems.
For instance, 02-Value: “Can you find it when you most need it?” qualifies for analyzing the
available data and giving context-sensitive answers when needed.

Fig 1: 8V’s of Big data Courtesy: Elena

Having understood the 8V’s of big data, let us look into details of research problems to be
addressed. General big data research topics [3] are in the lines of:

 Scalability — Scalable Architectures for parallel data processing


 Real-time big data analytics — Stream data processing of text, image, and video

 Cloud Computing Platforms for Big Data Adoption and Analytics — Reducing the cost
of complex analytics in the cloud

 Security and Privacy issues

 Efficient storage and transfer

 How to efficiently model uncertainty

 Graph databases

 Quantum computing for Big Data Analytics

Next, let me cover some of the specific research problems across the five listed categories
mentioned above.

The problems related to core big data area of handling the scale:-

1. Scalable architectures for parallel data processing:

Hadoop or Spark kind of environment is used for offline or online processing of data. The
industry is looking for scalable architectures to carry out parallel data processing of big data.
There is a lot of progress in recent years, however, there is a huge potential to improve
performance.

2. Handling real-time video analytics in a distributed cloud:

With the increased accessibility to the internet even in developing countries, videos became a
common medium of data exchange. There is a role of telecom infrastructure, operators,
deployment of the Internet of Things (IoT), and CCTVs in this regard. Can the existing
systems be enhanced with low latency and more accuracy? Once the real-time video data is
available, the question is how the data can be transferred to the cloud, how it can be processed
efficiently both at the edge and in a distributed cloud?

3. Efficient graph processing at scale:

Social media analytics is one such area that demands efficient graph processing. The role of
graph databases in big data analytics is covered extensively in the reference article [4].
Handling efficient graph processing at a large scale is still a fascinating problem to work on.

The research problems to handle noise and uncertainty in the data:-

4. Identify fake news in near real-time:

This is a very pressing issue to handle the fake news in real-time and at scale as the fake news
spread like a virus in a bursty way. The data may come from Twitter or fake URLs or
WhatsApp. Sometimes it may look like an authenticated source but still may be fake which
makes the problem more interesting to solve.

5. Dimensional Reduction approaches for large scale data:

One can extend the existing approaches of dimensionality reduction to handle large scale data
or propose new approaches. This also includes visualization aspects. One can use existing
open-source contributions to start with and contribute back to the open-source.

6. Training / Inference in noisy environments and incomplete data:

Sometimes, one may not get a complete distribution of the input data or data may be lost due
to a noisy environment. Can the data be augmented in a meaningful way by oversampling,
Synthetic Minority Oversampling Technique (SMOTE), or using Generative Adversarial
Networks (GANs)? Can the augmentation help in improving the performance? How one can
train and infer is the challenge to be addressed.

7. Handling uncertainty in big data processing:

There are multiple ways to handle the uncertainty in big data processing[4]. This includes
sub-topics such as how to learn from low veracity, incomplete/imprecise training data. How
to handle uncertainty with unlabeled data when the volume is high? We can try to use active
learning, distributed learning, deep learning, and fuzzy logic theory to solve these sets of
problems.

The research problems in the security and privacy [5] area:-

8. Anomaly Detection in Very Large Scale Systems:

The anomaly detection is a very standard problem but it is not a trivial problem at a large
scale in real-time. The range of application domains includes health care, telecom, and
financial domains.

9. Effective anonymization of sensitive fields in the large scale systems:

Let me take an example from Healthcare systems. If we have a chest X-ray image, it may
contain PHR (Personal Health Record). How one can anonymize the sensitive fields to
preserve the privacy in a large scale system in near real-time? This can be applied to other
fields as well primarily to preserve privacy.

10. Secure federated learning with real-world applications:

Federated learning enables model training on decentralized data. It can be adopted where the
data cannot be shared due to regulatory / privacy issues but still may need to build the models
locally and then share the models across the boundaries. Can we still make the federated
learning work at scale and make it secure with standard software/hardware-level security is
the next challenge to be addressed. Interested researchers can explore further information
from RISELab of UCB in this regard.

11. Scalable privacy preservation on big data:

Privacy preservation for large scale data is a challenging research problem to work on as the
range of applications varies from the text, image to videos. The difference in country/region
level privacy regulations will make the problem more challenging to handle.

The research problems related to data engineering aspects:-

12. Lightweight Big Data analytics as a Service:

Everything offering as a service is a new trend in the industry such as Software as a Service
(SaaS). Can we work towards providing lightweight big data analytics as a service?

13. Auto conversion of algorithms to MapReduce problems:

MapReduce is a well-known programming model in Big data. It is not just a map and reduce
functions but provide scalability and fault-tolerance to the applications. However, there are
not many algorithms that support map-reduce directly. Can we build a library to do an auto
conversion of standard algorithms to support MapReduce?

14. Automated Deployment of Spark Clusters:

A lot of progress is witnessed in the usage of spark clusters in recent times but they are not
completely ready for automated deployment. This is yet another challenging problem to
explore further.
The research problems in intersection of big data with data science:-

15. Approaches to make the models learn with less number of data samples:

In the last 10 years, the complexity of deep learning models increased with the availability of
more data and compute power. Some researchers proudly claim that they solved a complex
problem with hundreds of layers in deep learning. For instance, image segmentation may need
a 100 layer network to solve the segmentation problem. However, the recent trend is that can
anyone solve the same problem with less relevant data and with less complexity? The reason
behind this thinking is to run the models at the edge devices, not just only at the cloud
environment using GPUs/TPUs. For instance, the deep learning models trained on big data
might need deployment in CCTV / Drones for real-time usage. This is fundamentally
changing the approach of solving complex problems. You may work on challenging problems
in this sub-topic.

16. Neural Machine Translation to Local languages:

One can use Google translation for neural machine translation (NMT) activities. However,
there is a lot of research in local universities to do neural machine translation in local
languages with support from the Governments. The latest advances in Bidirectional Encoder
Representations from Transformers (BERT) are changing the way of solving these problems.
One can collaborate with those efforts to solve real-world problems.

17. Handling Data and Model drift for real-world applications:

Do we need to run the model on inference data if one knows that the data pattern is changing
and the performance of the model will drop? Can we identify the drift in the data distribution
even before passing the data to the model? If one can identify the drift, why should one pass
the data for inference of models and waste the compute power. This is a compelling research
problem to solve at scale in the real world. Active learning and online learning are some of
the approaches to solve the model drift problem.

18. Handling interpretability of deep learning models in real-time applications:

Explainable AI is the recent buzz word. Interpretability is a subset of explainability.


Machine / Deep learning models are no more black-box models. Few models such as
Decision Trees are interpretable. However, if the complexity increases, the base model itself
may not be useful to interpret the results. We may need to depend on surrogate models such
as Local interpretable model-agnostic explanations (LIME) / SHapley Additive exPlanations
(SHAP) to interpret. This can help the decision-makers with the justification of the results
produced. For instance, rejection of a loan application or classifying the chest x-ray as
COVID-19 positive. Can the interpretable models handle large scale real-time applications?

19. Building context-sensitive large scale systems:

Building a large scale context-sensitive system is the latest trend. There are some open-source
efforts to kick start. However, it requires a lot of effort in collecting the right set of data and
building context-sensitive systems to improve search capability. One can choose a research
problem in this topic if you have a background on search, knowledge graphs, and Natural
Language Processing (NLP). This is applicable across the domains.

20. Building large scale generative based conversational systems (Chatbot frameworks):

One specific area gaining momentum is building conversational systems such as Q&A and
Chatbot generative systems. A lot of chatbot frameworks are available. Making them
generative and preparing summary in real-time conversations are still challenging problems.
The complexity of the problem increases as the scale increases. A lot of research is going on
in this area. This requires a good understanding of Natural Language Processing and the latest
advances such as Bidirectional Encoder Representations from Transformers (BERT) to
expand the scope of what conversational systems can solve at scale.

Research Methodology:

Hope you can frame specific problems with your domain and technical expertise from the
topics highlighted above. Let me recommend a methodology to solve any of these problems.
Some points may look obvious for the researchers, however, let me cover the points in the
interest of a larger audience:

Identify your core strengths whether it is in theory, implementation, tools, security, or in a


specific domain. Other new skills you can acquire while doing the research. Identifying the
right research problem with suitable data is kind of reaching 50% of the milestone. This may
overlap with other technology areas such as the Internet of Things (IoT), Artificial
Intelligence (AI), and Cloud. Your passion for research will determine how long you can go
in solving that problem. The trend is interdisciplinary research problems across the
departments. So, one may choose a specific domain to apply the skills of big data and data
science.

Literature survey: I strongly recommend to follow only the authenticated publications such
as IEEE, ACM, Springer, Elsevier, Science direct, etc… Do not get into the trap of
“International journal …” which publish without peer reviews. Please do not limit the
literature survey to only IEEE/ACM papers only. A lot of interesting papers are available
in arxiv.org and paperswithcode. One needs to check/follow the top research labs in industry
and academia as per the shortlisted topic. That gives the latest research updates and helps to
identify the gaps to fill in.

Lab ecosystem: Create a good lab environment to carry out strong research. This can be in
your research lab with professors, post-docs, Ph.D. scholars, masters, and bachelor students in
academia setup or with senior, junior researchers in industry setup. Having the right
partnership is the key to collaboration and you may try the virtual groups as well. Having that
good ecosystem boosts up the results as one can challenge the others on their approach to
improve the results further.

Publish at right avenues: As mentioned in the literature survey, publish the research
papers in the right forum where you will receive peer reviews from the experts around the
world. We may get obstacles in this process in the way of rejections. However, as long as you
receive constructive feedback, one should be thankful to the anonymous reviewers. You may
see the potential opportunity to patent the ideas if the approach is novel, non-obvious, and
inventive. The recent trend is to open source the code while publishing the paper. If your
institution permits it to open source, you may do so by uploading the relevant code in Github
with appropriate licensing terms and conditions.

You might also like