Big Data as we know it has been around for a little over a decade, and its definition and applications have undergone major changes during that time. However, as technology continues to evolve at a rapid pace, it’s worth asking the question: is the era of Big Data coming to an end?
First of all, what is Big Data? In its simplest definition, Big Data refers to the massive amounts of information generated by various sources (e.g. social media, IoT devices, etc.) that can be analysed to gain insights and make better informed decisions. The primary challenge of Big Data has always been how to process and analyse such a large volume and variety of information in a timely and cost-effective manner.
Now that we’ve defined Big Data, our question is: could the end of the Big Data frenzy be in sight, at least in the Data Science and Machine Learning (ML) business practice?
The majority of data scientists have come to the conclusion that it’s not about data quantity but data quality: Having a few tens of thousands of samples of good quality data is more valuable for most (if not all) ML algorithms than having millions or billions of records containing duplicate samples, incorrect information, imbalanced targets and missing values.
The big data concepts might still be valuable in areas of BI, data analytics, insights or data quality assessments. However, for pure ML development, it could be seen as a burden in today’s landscape with higher training costs, unworkable AutoML pipelines or in-memory processes, typically larger models to store, and bigger datasets to be maintained and to perform EDA over.
In Data Science, ML and MLOps, by default, investments should be much more of a solid data engineering process to get a concise, sub-sampled dataset with high-quality examples that represent the problem in hand, rather than working at scale with all of the “information” simply transformed or extracted out of the raw data.
While the end of Big Data as we know it is not certain, several factors could significantly change how we process and analyse data in the near future. For example, making the right decisions to balance data quality over quantity and exploring which scenarios of having a larger data volume is valuable for the specific ML tasks. Regardless, it’s important that you leverage the power of data to drive your organisation forward.