Charles
Babbage. 1864. Passages from the Life of a Philosopher (chap. 5).
About 9 months ago, ~ 4½ generations in data science evolution,
IDC released a report from their FutureScape Program titled: Big Data &
Analytics 2015 Predictions. It’s still 2015, & I have been doing a good
amount of work with healthcare organizations deploying & readying these
groups to use contemporary (Hadoop-based) analytics in their strategic decision
making, so,… I thought I’d look back at these predictions & react to them
from my current experience.
4½ generations old, you say… The good news about this is that
actual field deployment & use of “big data” analytics has not been growing as
fast as the amount of material written in general (& the hype) about this
topic[1].
First, deployment & use in any productive way requires some amount of
organizational change & alignment & that slows down deployment. Second,
as I have written about previously, acquisition & deployment of a
technology is not the same as its adoption[2].
This is especially true in this case as adoption requires: 1) the deployment of
additional different infrastructure & user-facing technologies &, 2)
some (not small) amount of change in the way an organization thinks about data,
the use of information & decision-making. Finally, most organizations are
conservative when it comes to this level of change, so adoption &
productive use requires management commitment & a senior champion to keep
things moving until leverage can be demonstrated. Given all this, I think we
are 2-4 years from the adoption & productive use of big data analytics
being in general use in well-resourced organizations & 5-8 years for
everyone else.
The first thing to do if I’m going to comment on IDC’s
predictions is to define what big data is. There are many, many attempts to do
this on the net, but suffice to say that there is no known (at least by me)
consensus as to what the term means. Quite a number of definitions are couched
in terms of volume of data, & most of these use ≥ 1PB as the boundary for big data. I think it is more
nuanced than that; in fact I think that at least the following considerations
are relevant:
·
Volume - ≥
1PB is as good a number as any, but much smaller amounts of data can be
productively analyzed, see material below on data volume & data quality
·
Storage technology – Currently relational-based
data warehousing is the primary enterprise systems method for the storage of
large amounts of data. This technology has definite limitations in terms of
volume of data & performance that will be covered later. More contemporary,
& effective storage technologies include: NoSQL database, graph database
& massive, parallel distributed file systems e.g. Hadoop Distributed File System (HDFS) or sparse array
technologies e.g. Google Cloud Big
Table. These last two are most associated with big data storage.
·
Variety – The variety of the data stored in a
system is also an issue. Data warehouses require a standardized data model that
all data is normalized against. Data types that cannot be normalized are not
stored in native form. Normalization & transformation is a complicated
& very time-consuming process, as is updating the model if/when data requirements
change. HDFS & Big Table like systems do not require a standardized data
model or this type of normalization. An almost unlimited number of types of
data can be stored & utilized in these systems making them much more
aligned with real-world needs.
·
Analysis technology – Many systems, regardless
of the volume of data, provide SQL based query as well as conventional BI as
the primary analysis technology. Even using Turing-complete versions of SQL (e.g. T-SQL) limits the types of queries
& analysis that can be done, though all HDFS based systems allow some form
of SQL-based query. There certainly are many more SQL programmers in the world
than data scientists who model & query in other systems/languages. Yarn
(MapReduce2) is often used for analysis with HDFS & Big Table based systems
although the most effective analysis is provided by model development in a
language such as R, Pig, Python, etc. operating against HDFA/Yarn.
·
Search - Storage of substantial amounts of
information requires a search function that is integrated & well aligned with
storage & analysis functions. Conventional search techniques used in
relational-based data warehouses tend to lower performance as data volume
increases. Applications such as Apache Solr, elasticsearch & the search
associated with HBase, Hive etc. are designed to work with massive data storage
facilities.
Prediction is hard, we are told. The great Yogi Berra is
reported to have opined that, “Prediction is hard, especially when it’s about
the future”. One of the best ways to ensure that your predictions are accurate
is to make them about the present. As a technology futurist, I’ve been guilty
of this myself. Many of the IDC predictions are just this. They are not wrong…
they may be accurate & prescient statements about the state of the
technology – they just are not predictions.
So what are the IDC predictions? I’ll list them here with
associated comments:
1.
Visual data discovery
tools will be growing 2.5x faster than rest of the BI market. By 2018 investing
in this enabler of end user self-service will become a requirement for all
enterprises.
I think
this is essentially correct & already happening faster than this prediction.
Many organizations already use visual front-ends for their business
intelligence (BI) analysis & so their expectations are set by current
practice. Unfortunately, this often means that it is difficult to look beyond
current practice to find unique & productive ways of using tools that are
already being used in another context. IDC points out that the adoption of
visualization tools is driven by a demand for “self-service” & BYOT (tools)
in analysis. While this might be true, a more compelling motivation is that
often the results of complex modeling, or even complex statistical analysis,
are difficult to interpret if one is not a data scientist. Tools such as Qlik
& Tableau allow for quick summarization of certain types of results in an
easily understandable form. One of the downsides of the use of these tools is
that often results are complex or more nuanced than can be expressed in
visualizations. I have found that these tools are best used to summarize
results for executive review.
2.
Over the next 5 years
spending on cloud-based BDA (big data & analytics) solutions will grow 3x
faster than spending for on-premise solutions. Hybrid on/off premise
deployments will become a requirement.
Also
essentially correct, but what it says about big data & analytics is that
the majority of systems used for this effort will be cloud-hosted in some way.
This both reduces cost, as opposed to on-premise hosting, & reduces risk as
cloud services take on the maintenance & upgrade efforts needed to make the
analytic effort work.
There is a
different way of looking at this, though, one that I have to deal with
currently on several data projects. The problem that this solution tries to
address is the provision of (perceived) adequate levels of data privacy &
security. A number of companies that I’m working with have specified that no
data is to leave their security perimeter & similarly, no processing
(analysis) is to be done off-site. This was fine when companies had 1 PB (1x109) of data
but gets difficult to impossible at 500 PBs or, say, 5 TBs (5x1012). At least one of these companies has a respectable amount of
data (35-40 PBs). This company has chosen a commercial private cloud deployment
for BDA (from Rackspace) that allows for self-service provisioning of servers,
elastic scalability, multitenancy & many of the advantages of cloud-based
deployments in what is essentially a private environment. Luckily, they have a
well-resourced & capable IT group as this is not an easy deployment, even
with a commercial vendor. The most popular current private cloud deployment is
an open source package from OpenStack. There are many (many) horror stories
related to these deployments (not just OpenStack, but any of the open source
vendors). The technical & deployment issues with open source private stack
have been written about at least since mid-2013[3].
Of course, as Mr Asay & many others have pointed out, the real problem is
the contradiction of developing & deploying a cloud infrastructure for
private use. I actually think there is a place for this specific solution, but
I agree that it is much more likely to be used as a hybrid solution where data
& proprietary results may be stored privately, but processing may be done
publically, so long as results & data are stored privately. Of course, the
company I’m working with that has 40 PBs of data wanted everything on premise,
although they did the deployment at a remote data center that they lease space
at, so everything is relative…
3.
Shortage of skilled
staff will persist. In the U.S. alone there will be 181K deep analytics roles
in 2018 and 5x that many positions requiring related skills in data management
and interpretation.
I don’t
know how IDC derived these numbers, but they are probably as good as any. The
thing that I think has to be addressed is how do we promote & advance
analytics without having to wait for 181,000 doctoral level data scientists to
be trained. It’s Q4 2015, can this many people be trained by 2018? Probably
not… This means that we have to “make due” with what we have. In part we are
already doing this by using SQL as a means of querying HDFS data, but we need
to go beyond this. SQL is (kinda) OK as a database query language[4],
but it was never intended, despite the provision for stored procedures[5],
as a modeling & functional execution language. Data analysts need to begin
to learn to use additional languages to model & query the data in massive
stores.
As of
today, the primary languages in use for this are: R, Python, Java, Scala, Pig
Latin & various additions to the Hadoop stack such as Hive. R is the most
used analytic language at present. It is excellent for complex statistical
models & analysis & has a very rich ecosystem of added functions &
libraries. It is probably just on the cusp of being superseded, primarily
because it has some size limitations with respect to how much data it can deal
with. Python is easier to learn & use than R, but it still has some size
limitations & is not highly performant at scale. Java, of course, has the
advantage of a very large programmer base, but it is not very good at
statistical analysis (complex functions have to be written rather than called
from libraries). It does have good options for the display of statistical
results. Scala is Java-based & is mainly used today to build machine-learning
algorithms. Pig Latin is run on the Pig platform (developed at Yahoo Research
& now under Apache license) & is an abstract idiom (notation) of Java
that allows high-level programming of MapReduce jobs for analysis.
Probably
the easiest path here is to start with what you know, SQL moving to Java moving
to something like Pig to be able to write MapReduce jobs directly. Also,
keeping up with what is being put into open source in this area is important as
many languages & visual palettes that allow direct MapReduce programming
will be developed & released over the next 2-3 years.
4.
By 2017 unified data
platform architecture will become the foundation of BDA strategy. The
unification will occur across information management, analysis, and search
technology.
This has
already happened as evidenced by the offerings of Cloudera, Hortonworks, etc. I
expect that offerings of this type will continue to be more & more
integrated so that ultra-large scale information management analysis &
search will seem almost seamless. This will also apply to offerings based on
other (Non-Hadoop) platforms such as Spark.
5.
Growth in applications
incorporating advanced and predictive analytics, including machine learning,
will accelerate in 2015. These apps will grow 65% faster than apps without
predictive functionality.
This is
also essentially already happening. Very few people actually understand the
internals of machine learning, but more & more businesses are basing
business models & products on it. I think that there are two major paths
here. The first is that products in the form of applications & add-on
modules will become very important so that machine learning in some form can be
integrated into many different types of business processes. Second, these machine-learning
modules will also be integrated into the analytic stacks mentioned above…
Currently there are several such modules associated with both Hadoop (Apache
Mahout, Wabbit, Cloudera Oryx, oxdata H2O, MLLib) & Spark (MLLib, Cloudera Oryx). Most of these are trend
analysis &/or predictive algorithms. Actual machine learning (supervised or
unsupervised) integrated with analytics stacks is still a little ways off (2-3
years).
6.
70% of large organizations already purchase external data and
100% will do so by 2019. In parallel more organizations will begin to monetize
their data by selling them or providing Value Added Content.
Again, this is not a real prediction. If
70% of organizations do this today, it will be considerably before 2019 that
close to 100% of them do it. The real (IMHO) question is what kinds of data are
being used & for what purpose(s)? Most of the BDA projects that I have seen
fall into two categories: 1) optimization of operational processes &
decisions, & 2) optimization of specific knowledge-based processes &
decisions such as medical diagnosis or developing predictive trends in interest
rate movements. In the first case mostly internal data is used except for
operational benchmarking. This requires external data for comparison &
trend development. In the second case, substantial amounts of data exist
outside of a typical organization that will enhance the optimization of
knowledge-based processes. In healthcare, for instance, this might include
clinical data from partners or State & Federal sources, population health
data from partners or State & Federal sources, registry data on
immunization, best practice data from public & private sources & many
more. Most of this will require acquiring data from external sources, & in
turn, as IDC has pointed out, this will provide substantial opportunities for
organizations to monetize their data & for intermediaries to aggregate
& “sell” data.
7.
Adoption of technology to continuously analyze streams of
events will accelerate in 2015 as it is applied to IoT analytics – which is
expected to grow at a 5-year CAGR of 30%.
The main driver for this, at least
commercially, will be the internet of things (IoT), so depending on how quickly
you believe the IoT will be developed, deployed & adopted at scale, this
rate is either wildly too low or wildly too high. I currently do not see a huge
push commercially for the IoT, so I think this rate is too high. In 5 years, I
think we’ll still be looking at early adoption, especially in at-home use,
except in some specific areas. These include (but, as we futurists say, are not
limited to):
·
Healthcare – remote monitoring both in facility & at home
will increase the number of sensors that are reporting on an individual’s
health status.
·
Transportation – There are separate areas: autonomous driving
will increase the data flow as information is analyzed in both real-time &
asynchronously. This will also be true for larger scale data streams for things
like traffic control.
·
Process Manufacturing – This has already happened to a large
extent. Industries such as chemical production have used continuous monitoring
& analysis of data for years.
·
Other Manufacturing – More & more discrete manufacturing
processes are designed with this type of sensing & monitoring
8.
Decision management platforms will expand at a CAGR of 60%
through 2019 in response to the need for greater consistency in decision making
and decision making process knowledge retention.
This one I am not so sure about. As above,
I think this will vary wildly in different segments. For generalized business
& strategy decisions, I think this will continue to be a hard sell. The
technology exists now to greatly facilitate & enhance planning &
decision-making processes, but there are real social & cultural impediments
to adoption. In my experience these are of two types: 1) unfamiliarity with the
concepts & possibilities of big data analysis, & 2) organizational
& individual homeostasis that translates into resistance to change, even in
the face of real need for change. I started this essay by stating the
prediction is hard… I believe that change is even harder. People can be
educated about the concepts & function of new technology & process,
they may even come to understand the technology & its possible uses &
advantages, but if they are socially &/or culturally biased against change,
even if covertly, adoption of the new technology is very difficult. I believe
that general adoption of decision making platforms, or any other manifestation
of big data & analytics, will require some early adopters to have
unqualified successes that are obvious & evident. This may happen in
specific segments over the next 3 or so years, at which point many more
individuals & organizations, in that segment, will be more amenable to
adoption. This pattern of adoption by segment (healthcare, financial services
etc.) will ensure that general adoption beyond early adopters & risk takers
will be delayed, probably in the 5-8 year range.
9.
Rich media (video, audio, image) analytics will at least
triple in 2015 and emerge as the key driver for BDA technology investment.
I’m also not so sure about this one. I
cannot see this tripling by the end of 2015, as the investment made in general
in BDA does not support any segment of it tripling this year. I also cannot see
this being the primary driver for BDA investment anytime soon. Most companies
that are investing at this point are not primarily in the business of media
analytics, at least not in the segments I have a view into. This may be true in
the media industry, but I do not yet see a business problem &/or business
model that would lead to this rate of adoption.
The other issue here is that analysis of
this type of data is still primarily of its metadata. Analysis & more
detailed analytics of the internals (actual content) of media data types are
still research topics. It will take another 3-5 years before we have reliable
& accurate algorithms to do this type of analysis. Of course, I could be
wrong,… but…
10. By 2018
half of all consumers will interact with services based on cognitive computing
on a regular basis.
Well, this depends on what your definition of “cognitive
computing” is. TechTarget defines it as “…simulation of human thought processes
in a computerized model. Cognitive
computing involves self-learning
systems that use data mining, pattern
recognition and natural language
processing to mimic the way the human brain works.[6]”
The definition goes to say to say that the main function in these
systems is machine learning. OK, I’ve
been working in this area since the late 1970s & I am not convinced that
such systems exist today or that they will, in fact, exist in 2018. Systems
that use many of these capabilities certainly exist today. Some of those
systems are even productive & interesting in specific ways. If this
prediction is meant to mean that a large number of people will interact with
systems that have used a supervised random tree algorithm to produce an
optimized prospect list for an insurance product, then that is true today. If
this prediction is meant to mean that a large number of people will interact
with a system that has a general capability for interacting with &
responding to a person with some underspecified request, I think that it is
very unlikely in this timeframe. I have written much more extensively on this
in a previous blog post.[7]
So where are we? I think many of IDC’s predictions were, in
fact peridictions[8].
The two most interesting one (IMHO) were number 2 regarding cloud-based BDA
solutions & number 3 regarding lack of knowledgeable people to do analysis.
There are many pluses & minuses to cloud deployments, especially with
respect to security & privacy requirements. The potential impediments in
these deployments will be dealt with over the next few years by cloud vendors
& the use of hybrid clouds will, I believe, be the dominant deployment
model 5 years from now. Skilled people are another matter. If there really is a
need for close to 200K data scientists of various levels in the next 5 years,
that need cannot be met. We’ll have to proceed using what we already know &
what can be learned during this period, which includes: use of integrated
platforms that unify data storage, analytic design, execution of analysis &
visualization of results. We’ll also have to start building our analytic models
with what expertise is available… SQL moving to more suitable languages &
modeling capabilities. The one thing that we do know about big data &
analytics is that it will become increasingly important – initially more as a
way of determining what questions to ask & eventually (within 5-8 years) as
a way of developing & testing strategy[9].
Another thing we know is that this type of analytics, or any type, will not
provide actual answers for our strategic questions. That we’ll still have to do
ourselves, for the foreseeable future, until we really develop & can
interact with cognitive computing.
[1] Big Data does not even
appear on the Gartner 2015 Hype Cycle, although it was entering the “Trough of
Disillusionment” on the 2014 chart.
[2] The Coevolution of
Organizations & Technology. 1994. MIT/LFM Working Papers., A Framework
& Model for Technology Adoption in Healthcare Organizations. 2006.
PostTechnical Strategist.
[3] c.f.
Matt Asay writing for TechRepublic,
http://www.techrepublic.com/article/private-clouds-very-public-failure/
[4] The author was a member,
representing the Digital Equipment Corporation, of the ANSI X3H2 Standards
Committee that standardized SQL the first time in 1986.
[5] That I opposed during the
initial standardization & am still not a fan of today… Stored procedures
are a great way to make the function of an analysis opaque & not amenable
to modification or debugging.
[7] Turing Tests, Search &
Current AI. http://posttechnical.blogspot.com/2015/09/turing-tests-search-current-ai.html
[8] statements about the present
[9] see my blog post: Design Thinking
as Work Process & Strategy.
http://posttechnical.blogspot.com/2015/09/design-thinking-as-work-process-strategy.html