Wednesday, November 4, 2015

Big Data & Analytics - Predictions about the Present...



“On two occasions I have been asked,… if you put the wrong figures into the machine, will the right answers come out. I am not rightly able to apprehend the kind of confusion of ideas that could provoke such a question”

Charles Babbage. 1864. Passages from the Life of a Philosopher (chap. 5).



About 9 months ago, ~ 4½ generations in data science evolution, IDC released a report from their FutureScape Program titled: Big Data & Analytics 2015 Predictions. It’s still 2015, & I have been doing a good amount of work with healthcare organizations deploying & readying these groups to use contemporary (Hadoop-based) analytics in their strategic decision making, so,… I thought I’d look back at these predictions & react to them from my current experience.

4½ generations old, you say… The good news about this is that actual field deployment & use of “big data” analytics has not been growing as fast as the amount of material written in general (& the hype) about this topic[1]. First, deployment & use in any productive way requires some amount of organizational change & alignment & that slows down deployment. Second, as I have written about previously, acquisition & deployment of a technology is not the same as its adoption[2]. This is especially true in this case as adoption requires: 1) the deployment of additional different infrastructure & user-facing technologies &, 2) some (not small) amount of change in the way an organization thinks about data, the use of information & decision-making. Finally, most organizations are conservative when it comes to this level of change, so adoption & productive use requires management commitment & a senior champion to keep things moving until leverage can be demonstrated. Given all this, I think we are 2-4 years from the adoption & productive use of big data analytics being in general use in well-resourced organizations & 5-8 years for everyone else.

The first thing to do if I’m going to comment on IDC’s predictions is to define what big data is. There are many, many attempts to do this on the net, but suffice to say that there is no known (at least by me) consensus as to what the term means. Quite a number of definitions are couched in terms of volume of data, & most of these use 1PB as the boundary for big data. I think it is more nuanced than that; in fact I think that at least the following considerations are relevant:
·      Volume - 1PB is as good a number as any, but much smaller amounts of data can be productively analyzed, see material below on data volume & data quality
·      Storage technology – Currently relational-based data warehousing is the primary enterprise systems method for the storage of large amounts of data. This technology has definite limitations in terms of volume of data & performance that will be covered later. More contemporary, & effective storage technologies include: NoSQL database, graph database & massive, parallel distributed file systems e.g. Hadoop Distributed File System (HDFS) or sparse array technologies e.g. Google Cloud Big Table. These last two are most associated with big data storage.
·      Variety – The variety of the data stored in a system is also an issue. Data warehouses require a standardized data model that all data is normalized against. Data types that cannot be normalized are not stored in native form. Normalization & transformation is a complicated & very time-consuming process, as is updating the model if/when data requirements change. HDFS & Big Table like systems do not require a standardized data model or this type of normalization. An almost unlimited number of types of data can be stored & utilized in these systems making them much more aligned with real-world needs.
·      Analysis technology – Many systems, regardless of the volume of data, provide SQL based query as well as conventional BI as the primary analysis technology. Even using Turing-complete versions of SQL (e.g. T-SQL) limits the types of queries & analysis that can be done, though all HDFS based systems allow some form of SQL-based query. There certainly are many more SQL programmers in the world than data scientists who model & query in other systems/languages. Yarn (MapReduce2) is often used for analysis with HDFS & Big Table based systems although the most effective analysis is provided by model development in a language such as R, Pig, Python, etc. operating against HDFA/Yarn.
·      Search - Storage of substantial amounts of information requires a search function that is integrated & well aligned with storage & analysis functions. Conventional search techniques used in relational-based data warehouses tend to lower performance as data volume increases. Applications such as Apache Solr, elasticsearch & the search associated with HBase, Hive etc. are designed to work with massive data storage facilities.

Prediction is hard, we are told. The great Yogi Berra is reported to have opined that, “Prediction is hard, especially when it’s about the future”. One of the best ways to ensure that your predictions are accurate is to make them about the present. As a technology futurist, I’ve been guilty of this myself. Many of the IDC predictions are just this. They are not wrong… they may be accurate & prescient statements about the state of the technology – they just are not predictions.

So what are the IDC predictions? I’ll list them here with associated comments:


1.     Visual data discovery tools will be growing 2.5x faster than rest of the BI market. By 2018 investing in this enabler of end user self-service will become a requirement for all enterprises.
I think this is essentially correct & already happening faster than this prediction. Many organizations already use visual front-ends for their business intelligence (BI) analysis & so their expectations are set by current practice. Unfortunately, this often means that it is difficult to look beyond current practice to find unique & productive ways of using tools that are already being used in another context. IDC points out that the adoption of visualization tools is driven by a demand for “self-service” & BYOT (tools) in analysis. While this might be true, a more compelling motivation is that often the results of complex modeling, or even complex statistical analysis, are difficult to interpret if one is not a data scientist. Tools such as Qlik & Tableau allow for quick summarization of certain types of results in an easily understandable form. One of the downsides of the use of these tools is that often results are complex or more nuanced than can be expressed in visualizations. I have found that these tools are best used to summarize results for executive review.
2.     Over the next 5 years spending on cloud-based BDA (big data & analytics) solutions will grow 3x faster than spending for on-premise solutions. Hybrid on/off premise deployments will become a requirement.
Also essentially correct, but what it says about big data & analytics is that the majority of systems used for this effort will be cloud-hosted in some way. This both reduces cost, as opposed to on-premise hosting, & reduces risk as cloud services take on the maintenance & upgrade efforts needed to make the analytic effort work.
There is a different way of looking at this, though, one that I have to deal with currently on several data projects. The problem that this solution tries to address is the provision of (perceived) adequate levels of data privacy & security. A number of companies that I’m working with have specified that no data is to leave their security perimeter & similarly, no processing (analysis) is to be done off-site. This was fine when companies had 1 PB (1x109) of data but gets difficult to impossible at 500 PBs or, say, 5 TBs (5x1012). At least one of these companies has a respectable amount of data (35-40 PBs). This company has chosen a commercial private cloud deployment for BDA (from Rackspace) that allows for self-service provisioning of servers, elastic scalability, multitenancy & many of the advantages of cloud-based deployments in what is essentially a private environment. Luckily, they have a well-resourced & capable IT group as this is not an easy deployment, even with a commercial vendor. The most popular current private cloud deployment is an open source package from OpenStack. There are many (many) horror stories related to these deployments (not just OpenStack, but any of the open source vendors). The technical & deployment issues with open source private stack have been written about at least since mid-2013[3]. Of course, as Mr Asay & many others have pointed out, the real problem is the contradiction of developing & deploying a cloud infrastructure for private use. I actually think there is a place for this specific solution, but I agree that it is much more likely to be used as a hybrid solution where data & proprietary results may be stored privately, but processing may be done publically, so long as results & data are stored privately. Of course, the company I’m working with that has 40 PBs of data wanted everything on premise, although they did the deployment at a remote data center that they lease space at, so everything is relative…
3.     Shortage of skilled staff will persist. In the U.S. alone there will be 181K deep analytics roles in 2018 and 5x that many positions requiring related skills in data management and interpretation.
I don’t know how IDC derived these numbers, but they are probably as good as any. The thing that I think has to be addressed is how do we promote & advance analytics without having to wait for 181,000 doctoral level data scientists to be trained. It’s Q4 2015, can this many people be trained by 2018? Probably not… This means that we have to “make due” with what we have. In part we are already doing this by using SQL as a means of querying HDFS data, but we need to go beyond this. SQL is (kinda) OK as a database query language[4], but it was never intended, despite the provision for stored procedures[5], as a modeling & functional execution language. Data analysts need to begin to learn to use additional languages to model & query the data in massive stores.
As of today, the primary languages in use for this are: R, Python, Java, Scala, Pig Latin & various additions to the Hadoop stack such as Hive. R is the most used analytic language at present. It is excellent for complex statistical models & analysis & has a very rich ecosystem of added functions & libraries. It is probably just on the cusp of being superseded, primarily because it has some size limitations with respect to how much data it can deal with. Python is easier to learn & use than R, but it still has some size limitations & is not highly performant at scale. Java, of course, has the advantage of a very large programmer base, but it is not very good at statistical analysis (complex functions have to be written rather than called from libraries). It does have good options for the display of statistical results. Scala is Java-based & is mainly used today to build machine-learning algorithms. Pig Latin is run on the Pig platform (developed at Yahoo Research & now under Apache license) & is an abstract idiom (notation) of Java that allows high-level programming of MapReduce jobs for analysis.
Probably the easiest path here is to start with what you know, SQL moving to Java moving to something like Pig to be able to write MapReduce jobs directly. Also, keeping up with what is being put into open source in this area is important as many languages & visual palettes that allow direct MapReduce programming will be developed & released over the next 2-3 years.
4.     By 2017 unified data platform architecture will become the foundation of BDA strategy. The unification will occur across information management, analysis, and search technology.
This has already happened as evidenced by the offerings of Cloudera, Hortonworks, etc. I expect that offerings of this type will continue to be more & more integrated so that ultra-large scale information management analysis & search will seem almost seamless. This will also apply to offerings based on other (Non-Hadoop) platforms such as Spark.
5.     Growth in applications incorporating advanced and predictive analytics, including machine learning, will accelerate in 2015. These apps will grow 65% faster than apps without predictive functionality.
This is also essentially already happening. Very few people actually understand the internals of machine learning, but more & more businesses are basing business models & products on it. I think that there are two major paths here. The first is that products in the form of applications & add-on modules will become very important so that machine learning in some form can be integrated into many different types of business processes. Second, these machine-learning modules will also be integrated into the analytic stacks mentioned above… Currently there are several such modules associated with both Hadoop (Apache Mahout, Wabbit, Cloudera Oryx, oxdata H2O, MLLib) & Spark (MLLib, Cloudera Oryx). Most of these are trend analysis &/or predictive algorithms. Actual machine learning (supervised or unsupervised) integrated with analytics stacks is still a little ways off (2-3 years).
6.     70% of large organizations already purchase external data and 100% will do so by 2019. In parallel more organizations will begin to monetize their data by selling them or providing Value Added Content.
Again, this is not a real prediction. If 70% of organizations do this today, it will be considerably before 2019 that close to 100% of them do it. The real (IMHO) question is what kinds of data are being used & for what purpose(s)? Most of the BDA projects that I have seen fall into two categories: 1) optimization of operational processes & decisions, & 2) optimization of specific knowledge-based processes & decisions such as medical diagnosis or developing predictive trends in interest rate movements. In the first case mostly internal data is used except for operational benchmarking. This requires external data for comparison & trend development. In the second case, substantial amounts of data exist outside of a typical organization that will enhance the optimization of knowledge-based processes. In healthcare, for instance, this might include clinical data from partners or State & Federal sources, population health data from partners or State & Federal sources, registry data on immunization, best practice data from public & private sources & many more. Most of this will require acquiring data from external sources, & in turn, as IDC has pointed out, this will provide substantial opportunities for organizations to monetize their data & for intermediaries to aggregate & “sell” data.
7.     Adoption of technology to continuously analyze streams of events will accelerate in 2015 as it is applied to IoT analytics – which is expected to grow at a 5-year CAGR of 30%.
The main driver for this, at least commercially, will be the internet of things (IoT), so depending on how quickly you believe the IoT will be developed, deployed & adopted at scale, this rate is either wildly too low or wildly too high. I currently do not see a huge push commercially for the IoT, so I think this rate is too high. In 5 years, I think we’ll still be looking at early adoption, especially in at-home use, except in some specific areas. These include (but, as we futurists say, are not limited to):
·      Healthcare – remote monitoring both in facility & at home will increase the number of sensors that are reporting on an individual’s health status.
·      Transportation – There are separate areas: autonomous driving will increase the data flow as information is analyzed in both real-time & asynchronously. This will also be true for larger scale data streams for things like traffic control.
·      Process Manufacturing – This has already happened to a large extent. Industries such as chemical production have used continuous monitoring & analysis of data for years.
·      Other Manufacturing – More & more discrete manufacturing processes are designed with this type of sensing & monitoring

8.     Decision management platforms will expand at a CAGR of 60% through 2019 in response to the need for greater consistency in decision making and decision making process knowledge retention.
This one I am not so sure about. As above, I think this will vary wildly in different segments. For generalized business & strategy decisions, I think this will continue to be a hard sell. The technology exists now to greatly facilitate & enhance planning & decision-making processes, but there are real social & cultural impediments to adoption. In my experience these are of two types: 1) unfamiliarity with the concepts & possibilities of big data analysis, & 2) organizational & individual homeostasis that translates into resistance to change, even in the face of real need for change. I started this essay by stating the prediction is hard… I believe that change is even harder. People can be educated about the concepts & function of new technology & process, they may even come to understand the technology & its possible uses & advantages, but if they are socially &/or culturally biased against change, even if covertly, adoption of the new technology is very difficult. I believe that general adoption of decision making platforms, or any other manifestation of big data & analytics, will require some early adopters to have unqualified successes that are obvious & evident. This may happen in specific segments over the next 3 or so years, at which point many more individuals & organizations, in that segment, will be more amenable to adoption. This pattern of adoption by segment (healthcare, financial services etc.) will ensure that general adoption beyond early adopters & risk takers will be delayed, probably in the 5-8 year range.
9.     Rich media (video, audio, image) analytics will at least triple in 2015 and emerge as the key driver for BDA technology investment.
I’m also not so sure about this one. I cannot see this tripling by the end of 2015, as the investment made in general in BDA does not support any segment of it tripling this year. I also cannot see this being the primary driver for BDA investment anytime soon. Most companies that are investing at this point are not primarily in the business of media analytics, at least not in the segments I have a view into. This may be true in the media industry, but I do not yet see a business problem &/or business model that would lead to this rate of adoption.
The other issue here is that analysis of this type of data is still primarily of its metadata. Analysis & more detailed analytics of the internals (actual content) of media data types are still research topics. It will take another 3-5 years before we have reliable & accurate algorithms to do this type of analysis. Of course, I could be wrong,… but…
10. By 2018 half of all consumers will interact with services based on cognitive computing on a regular basis.

Well, this depends on what your definition of “cognitive computing” is. TechTarget defines it as “simulation of human thought processes in a computerized model. Cognitive computing involves self-learning systems that use data mining, pattern recognition and natural language processing to mimic the way the human brain works.[6]  The definition goes to say to say that the main function in these systems is machine learning. OK, I’ve been working in this area since the late 1970s & I am not convinced that such systems exist today or that they will, in fact, exist in 2018. Systems that use many of these capabilities certainly exist today. Some of those systems are even productive & interesting in specific ways. If this prediction is meant to mean that a large number of people will interact with systems that have used a supervised random tree algorithm to produce an optimized prospect list for an insurance product, then that is true today. If this prediction is meant to mean that a large number of people will interact with a system that has a general capability for interacting with & responding to a person with some underspecified request, I think that it is very unlikely in this timeframe. I have written much more extensively on this in a previous blog post.[7]

So where are we? I think many of IDC’s predictions were, in fact peridictions[8]. The two most interesting one (IMHO) were number 2 regarding cloud-based BDA solutions & number 3 regarding lack of knowledgeable people to do analysis. There are many pluses & minuses to cloud deployments, especially with respect to security & privacy requirements. The potential impediments in these deployments will be dealt with over the next few years by cloud vendors & the use of hybrid clouds will, I believe, be the dominant deployment model 5 years from now. Skilled people are another matter. If there really is a need for close to 200K data scientists of various levels in the next 5 years, that need cannot be met. We’ll have to proceed using what we already know & what can be learned during this period, which includes: use of integrated platforms that unify data storage, analytic design, execution of analysis & visualization of results. We’ll also have to start building our analytic models with what expertise is available… SQL moving to more suitable languages & modeling capabilities. The one thing that we do know about big data & analytics is that it will become increasingly important – initially more as a way of determining what questions to ask & eventually (within 5-8 years) as a way of developing & testing strategy[9]. Another thing we know is that this type of analytics, or any type, will not provide actual answers for our strategic questions. That we’ll still have to do ourselves, for the foreseeable future, until we really develop & can interact with cognitive computing.


[1] Big Data does not even appear on the Gartner 2015 Hype Cycle, although it was entering the “Trough of Disillusionment” on the 2014 chart.
[2] The Coevolution of Organizations & Technology. 1994. MIT/LFM Working Papers., A Framework & Model for Technology Adoption in Healthcare Organizations. 2006. PostTechnical Strategist.
[3] c.f. Matt Asay writing for TechRepublic, http://www.techrepublic.com/article/private-clouds-very-public-failure/
[4] The author was a member, representing the Digital Equipment Corporation, of the ANSI X3H2 Standards Committee that standardized SQL the first time in 1986.
[5] That I opposed during the initial standardization & am still not a fan of today… Stored procedures are a great way to make the function of an analysis opaque & not amenable to modification or debugging.
[7] Turing Tests, Search & Current AI. http://posttechnical.blogspot.com/2015/09/turing-tests-search-current-ai.html
[8] statements about the present
[9] see my blog post: Design Thinking as Work Process & Strategy.  http://posttechnical.blogspot.com/2015/09/design-thinking-as-work-process-strategy.html