Benefits of Big Data – and a few challenges
Benefits of Big Data are many, yet there seems to be much confusion and even fear about the topic. Therefore, before continuing with this article, it seems prudent to first define the term. Big Data is a computing term, defined, according to Google, as “extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions”. What that basically means is, big data is all the information that is collected by a company taking advantage of today’s computer & telecom technologies about the people visiting, interacting and using their computerized technology.
This is no longer limited to people who use a conventional computer, such as a desktop, laptop or tablet, but it is also those who use smartphones, drive cars, use ATMS, use stoves, coffee-makers and even electricity, thanks to the advent of chip technology (sensors) in virtually every appliance, smart meters on most homes…and even chips in human bodies. Data production will be 44 times greater in 2020 than it was in 2009 (source). In other words, Big Brother IS watching, no matter where you go, unless you happen to be right off the grid. But ultimately, for businesses, and in fact, for the world, itself, Big Data presents some real benefits. More data allows us to see new things, see things better, see things differently, and see more things than we ever saw before. This also presents opportunities and benefits that we never had before.
Big Data provides insight in to our customers, contacts and businesses that we never had before, often at a fraction of the cost of what we used to have to pay for just a fraction of the data. We can now see who visits our sites, what they do on our sites, and in some cases, even what they’re looking at on each page… imagine the possibilities! This is what made the french startup Criteo so successful with their ability to predict WHO probably buys WHAT on your e-commerce site. A proper analysis of all this data can help set the direction of marketing campaigns, choose what products and/or services to add or remove, and indeed, and this is where it gets really exciting in this 21st century, develop brand new products that never existed before, because we know from Big Data that these are things that will sell before we even develop them, because we know people want them!
But Big Data can do more than this. The possibilities that are opening up are astonishing… There has been a progression in so-called Big Data as technology has evolved. The progression began with the employees entering data, then users began to enter their own, such as when they joined a social site (see how it just scaled WAY up?) and in the case of companies such as Google, Facebook, Amazon and Ebay and a host of other medium to large sized companies, the amount of data collected was very big. However, now that machines are collecting the data by themselves, such as the virtual robots used by Google, your new car, smart meters, and others, the amount of data has become huge, far greater than humans can digest, and indeed, far greater than a single machine can process.
The “3 Vs” of Big Data are volume, velocity and variety (IBM added a 4th V : the veracity). Obviously, they are inextricably connected. If any one of them goes up, the other two invariably go up, too. Therefore, in many systems, most data is ignored, due to inadequacy in technology. In a nutshell, the hardware simply can’t process, or make sense of, all the data being thrown at it, so it simply purges most of it. What a waste! One of the solutions has been to use distributed databases and distributed file systems. In such a system, multiple machines are used, often in multiple locations. Wikipedia defines a distributed database as a “database in which storage devices are not all attached to a common processing unit such as the CPU, and which is controlled by a distributed database management system (together sometimes called a distributed database system)”. The problem with such a system can be quite challenging. The CAP theorem, developed by Eric Brewer, states that any such distributed database or shared data system can only have two of the three desirable properties at one time, which are:
- C – consistency, which is like having just one, up-to-date copy of the data
- A – availability, meaning the data is readily accessible for updates
- P – tolerance to partitions, meaning it will retain its integrity when partitions are used.
As you can imagine, when any one of these three is missing, it could be a real problem. When multiple conventional machine technology is used, it equates to using multiple partitions. When the volume is high (remember the 3 Vs), serious lag time can occur as the system tries to process everything. Remember having to wait so long for Facebook to open? (By the way, Facebook’s way of dealing with this was to cache pages and thereby serve up “old” data, which is why you have to refresh the page when you make a post, so you can see it. It is a patch, but it works reasonably well, most of the time).
With an estimate by Cisco of more than 4.8 zettabytes or 4.8 billion terrabytes of date on the Internet by the end of 2015, there is simply a massive amount of data available for processing, perhaps so great that the term, “Big Data” is now inadequate and it should now be called “Huge Data”. What this means is that, if the CAP theorem is true (it is… see what its inventor says about CAP Twelve Years Later: How the “Rules” Have Changed), then the more and faster the data is collected, the greater the challenge is going to be to use it all without stalling or completely crashing systems. Enter Apache™ Hadoop®, an open source project for the development of software specifically for “reliable, scalable, distributed computing,” probably the most widely used system by companies both large and small for dealing with Big Data. Obviously, if there’s a lot of data (the 3 Vs, again), one server isn’t going to be able to handle all of it. However, as soon as you start adding machines, you can run into the limitations described by the CAP theory. Furthermore, the human brain simply isn’t fast capable of making sense of all the data in its raw form, anyway. Here is where MapReduce, originally from Google, but now generalized, comes into play, which does parallel processing of large data sets and then condenses the information for later searching an recall. The condensed or “reduced” information can be scanned quickly, then the full data sets pulled up as necessary, only when they are relevant. This is the direction modern technology is taking us, where machines are doing vast amounts of work for us that would otherwise never be done. Another change that came about as the volume of data increased was the emergence of alternative NoSQL databases (see a list here). Relational databases (RDBMS with ACID properties -atomicity, consistency, isolation, durability) use a fixed schema, and are great for dealing in large volumes of structured data, but NoSQL is far more flexible and useful for dealing with large amounts of varied and unstructured data, otherwise known as “Big Data.”
Cloud based systems are becoming more and more the reliable norm, where rather than relying on single servers, the system connects hundreds and thousands of computers together with specialized connections to create virtual supercomputers, such as were previously only available for military installations and research facilities. These systems are capable of multiple trillions of computations per second, allowing the processing and analysis of very large sets of data at a fraction of what it once cost in money and time, if it was even possible. New technologies require new roles, and one that is increasingly important is that of a “data scientist.” A data scientist has generally the same general training as a business or data analyst, but also has the ability to spot relevant trends in the data and communicate those findings clearly to business and IT leaders.
Again, as our technology evolves, the benefits continue to increase, too. More and more, following the lead of Hadoop, machines are doing much of the analysis for us. In fact, new technology is “finding” new ways to detect cancer that were previously unknown, by analyzing vast quantities of patient data and recognizing patterns heretofore unknown. This has the potential to not only save money in cancer research, alone, but also to save time and lives as cancer is detected even earlier. The savings to our healthcare system just in this one area are potentially staggering. The McKinsey Global Institute recently estimated that the US could save $300 billion per year in efficiency savings, by using Big Data. Furthermore, they estimated that Europe could save $149 billion in government administration costs. These are just the tip of the iceberg. Imagine the savings to businesses, long known to be more efficient than governments, when they leverage Big Data. Big Data gives business new insights into new and emerging types of data that at one time might have been missed and discarded. As machines now do the heavy lifting for us, processing the data, discovering new potential industries and streamlining old ones, we gain more freedom to use our imaginations. Some jobs will disappear, but the possibilities for new ones in directions as yet undiscovered are staggering.
As we move further into the 21st century and begin to see “quantum computing” become a reality, Big Data is going to explode, but so are the benefits, as we are able to analyze the multitudes of data sets that will be produced. Things that were once thought impossible will quickly become historic facts as new insights and ideas turn into realities. Now is the time, if you’re not already there, to get caught up, whether in business or research or both; otherwise you will soon be far behind. Learn about the Big Data that applies to you, how to analyze it, and reap the benefits, before your competitors do!
Jean-Christophe (Jay C) Huc
16 June 2016 – email@example.com