We’re entering a new world where data may be more
important than software.
Tim O’Reilly
The last five years have seen a huge amount of change in the
way healthcare is delivered in this country & in the information systems
that broadly support this delivery. Much, but not all, of this change has been
brought about by the American Recovery & Reinvestment Act (2/2009, Pub.L.
111-5, ARRA) & the Patient Protection & Affordable Care Act (3/2010,
Pub.L. 111-148, ACA), as well as changes to the Center for Medicare &
Medicaid Services (CMS) Physicians Fee Schedule (PFS 2012-2016).
These laws & regulations, along with others, have
created an environment in the healthcare space in which providers &
healthcare organizations have had to respond to:
- Requirements for the Meaningful Use of healthcare information technology (HIT) in order to receive enhanced payments from CMS
- Participation in new types of organizations including Healthcare Information Exchanges (HIEs) & Accountable Care Organizations (ACOs)
- Transition to new forms of reimbursement as the system moves from a fee-for-service basis to value & performance based payment.
These represent huge structural & functional changes in
the healthcare system, & are subsequently creating issues &
requirements as the system evolves. These include:
- Need for increased connectedness among providers & healthcare organizations
- New organizational models such as HIEs & ACOs require higher levels of data sharing to improve care. This in turn requires the ability to share process & workflow across organizational & temporal boundaries
- Currently about 280 HIEs in the U.S. These require substantial network connectivity & a shared data architecture in order to be able to do anything more than “fax PDFs” among participants.
- Currently about 450 ACOs in the U.S. These need to be able to share demographic (identity) & cost/patient data in order to document shared risk criteria
- eReferral networks provide capability currently supported by fax machines. They mostly use DIRECT technology to do data sharing, but also have begun to offer shared clinical decision support
- All of these organizational models provide the opportunity for shared process including: clinical decision support, analytics, common & shared workflow for diagnosis & treatment, shared operational processes for operational coordination
- Need to access, manage & utilize much larger amounts of data as well as many more different types of data
- Most Community Health Centers (CHCs) have 150-250 GBs of EHR & PM data (usually representing 3-5 data years)
- Most Primary Care Associations (PCAs) have 5-10 TBs of data
- Most small clinics & hospitals are similar
- These values will more than double in the next 3-5 years
- Data sources will include: clinical & financial data from: EHR, PM, HIEs & ACOs, eReferral, State & Federal public & population health, public demographic & macroeconomic trend data, etc.
- Some organizations already have ultra-large amounts of data
- Kaiser is said to have 9.5M patient records for 15 years (~40-50 PBs)
- Many healthcare organizations, at all levels, are starting to create data extracts or data warehouses using existing (relational database) technology
- Increased patient complexity & associated treatment[1]
- 29& of the U.S. adult population has hypertension, average of 3 comorbidities: arthritis, high cholesterol, heart disease & diabetes
- 9%-10% of the U.S. adult population has diabetes, average 3.3 comorbidities including: obesity, hypertension, high cholesterol, heart disease arthritis
- New study finds 75% of U.S. men & 67% of U.S. women are obese[2]
- Patients with 3 comorbidities cost $6,178/year more to treat1
- Increased strategy & decision complexity
- Decisions, especially strategic decisions, need to be evaluated using much data as well as different types of data
- Such decisions can be shown to be more complex with respect to how much cognitive effort they require[3] & the informational complexity of both their representation & their alternatives[4]
- There is some evidence to show that as decision complexity increases, people tend to use less probability-based decision support[5]
These structural & functional changes will require
healthcare organizations to evolve, & this means that their information
systems will also have to evolve. Current operational & clinical processes
(workflows) are not adequate to support this evolution. Healthcare
organizations of all types will need to move toward new care & treatment
models that emphasize both multi-organizational shared processes &
utilization of much larger amounts of data. The emphasis on data &
process sharing is especially important to provide appropriate continuity of
care & transitions of care (personnel, locational etc.). New structural
models such as HIEs & eReferral networks will require workflows that span
organizational boundaries. A provider working with data from an HIE or
eReferral partner will need a process that takes into account external data
& may need a process that allows for more direct for intermixing of
workflows between partners. Working in
such multi-organizational entities will require new workflow tools that provide
for much more collaboration. These workflows have the possibility of adding
considerable complexity to patient treatment & will have to be carefully designed
to avoid this. In addition, working in ACOs will require the ability to
generate, share & use empirical cost/patient data. Many healthcare
organizations estimate this from claims data or actually use claims data as a
surrogate for cost. This is not possible in the ACO model. The poor alignment
between cost & clinical data makes this difficult. Many systems have no
data connection (common fields or keys) between their financial system data
& their EHR data. This will have to change in order for the ACO model to
work with a broad set of healthcare organizations.
More than the workflow & data environments will have to
change. The systems that support these systems will also have to change.
Current network, server, storage & applications infrastructure will have to
evolve to meet the requirements for information sharing & the use of much
larger amounts of data.
Most HIT systems in use today, including Practice Management
& EHR systems, as well as other directories (immunization, etc.) & data
warehouse efforts are based on underlying relational database technology.
Relational database (RDBMS) technology was, in turn, based on the relational
model of data proposed by E.F. “Ted” Codd in 1970[6].
Efforts to develop software systems based on this model started at UC Berkeley
& IBM in 1973. By 1984, commercial products were available from Ingres
Corporation (INGRES), IBM (System R) & the Digital Equipment Corporation
(Rdb)[7].
At this point, the design for RDBMSs is 45 years old – perhaps >10
generations of software evolution have occurred since then. I know from
personal experience that these systems were not designed for the management of
close to unlimited amounts of data, & even though they have been updated
continuously during the last 45 years, they do not provide an adequate
foundation for current & near-future use. RDBMS systems are not
appropriate, or even highly functional, at levels of data >1PB. There are
already some healthcare organizations that have more data than this.
In addition, data in RDBMSs needs to be in highly structured
(relational normal form) & alphanumeric in type. Large advances have been
made in the management & utilization of unstructured data in the last 25
years, but RDBMs still manage this material as binary large objects (blobs)
that are not searchable or utilizable except as a whole. As already stated,
healthcare data even for individual organizations, is fast approaching a size
that makes the use of relational systems infeasible. The real issue, however,
is the necessary heterogeneity of this data. PM & EHR systems generate both
structured & unstructured data, such as physician’s notes. There is also
financial data that is generated & managed by separate systems, that even
though structured is in very different formats & not relationally tied to
the clinical, demographic & financial (claims) data managed in PM & EHR
systems. Then there is additional data, generated by separate operational
systems from HIE & ACO partners. Finally, there is a whole “ocean” of both
structured & unstructured data in such systems as: State & Federal
public & population health sources, various demographic & financial
trend data from public & private sources, & a whole variety of of other
sources & types of data required for participation in new types of organizations
& for the analytics increasingly required for clinical & operational
optimization. Aggregation of this data, even virtually, through current
conventional means (data warehouse based on RDBMS technology) is a daunting
prospect as all of the data would have to be empirically & semantically
normalized… in many cases an infeasible task [8].
Recently, healthcare organizations have attempted to get
around some of these limitations by creating data extracts & data
warehouses with their in-house data. A data warehouse is “an extract of an organization’s
data - often drawn from multiple sources - to facilitate analysis, reporting
and strategic decision making. It contains only alphanumeric data, not
documents or other types of content. The data is stored separately from the
organization’s primary applications and databases such as practice management
systems and electronic health records. The data is transformed to match a
uniform data model, cleansed of duplicates and inaccuracies, and is extracted
with business intelligence and reporting tools. A data warehouse contains the
entire scope of data and can be used for both very general and very specific analysis
and reporting.”[9]
In other words, it
is a single, accessible data storage facility separate from any of the data
sources (PM, EHR etc.) that is normalized & highly structured according to
a single data model that spans all of the data sources. It generally is
designed to contain as much of the data of an organization (or multiple
organizations) as is possible, Data that cannot fit into this structure is not
part of the warehouse. A data extract is similar except that is a subset of
the data warehouse designed for specific purpose such as to provide the data
for a particular report or set of reports. The development of a data warehouse
is a very large effort, often multiple years, that includes a long design
process, a long data extraction, normalization, translation & load process,
& a long testing process. Data extracts may be updated on a short time
period basis, even nightly, but data warehouses are updated more rarely, so
that the data in them, while correct, may be stale except for retrospective
analysis. Data extracts & warehouses tend to be too labor intensive in
design & implementation & too rigid with respect to the need for
normalization & translation to be of much use in an environment where data
is constantly generated & extremely heterogeneous, as already mentioned.
Most healthcare organizations use either a report generator
or a BI tool to provide query capability of the primarily PM & EHR data
that they manage. All but the very largest or best funded tend to run canned
reports that are programmed to be executed at specific intervals. These reports
include such things as: amount billed per time period, number of patients per
time period etc. Also, most organizations run required quality measures that
must be reported to various government & accrediting agencies. Report
generators & BI tools are adequate to perform these analytic tasks, but
they are inadequate in several areas. The first is in the amount of data that
they can deal with. Most BI tools & report writers are limited to accessing
data from relational databases, so they have whatever limitations the
underlying database has. Many, though, have additional limitations on the
amount of data that they can use in a report. Tools such as Cognos (IBM),
Business Objects (SAP) & Crystal Reports (SAP) all have limitations on
either the number of tables they can draw data from or the number of parameters
they can utilize. These are substantial limitations in the new universe of
data. Also their reliance on an underlying relational database is, in itself a
substantial limitation. Finally, the tools are mostly limited to the use of SQL
as a query language. The ANSI X3H2 committee standardized SQL as a database
query language in 1986[10]. It was intended as a specialized, non-Turing
complete language for the manipulation of relational data according to the
relational algebra & tuple calculus described by Ted Codd & initially
developed by IBM in the 1970s. It was never intended as a general inquiry or
modeling tool, & in fact, is not appropriate as one. Finally, these tools
rely on data normalization & translation capabilities, often not supplied
by the tool, in order to be able to manipulate data & produce reports.
It appears that current HIT systems are not
designed in such as way as to provide adequate support for the documentation
& treatment of today’s complex patients or for the support of the new types
of workflows that are required by new regulation & organizational models. A
recent informal survey by the author[11]
was intended to determine whether several EHR systems currently in use could
produce a patient record with multiple comorbidities reported without untoward
effort. Six different EHR vendors were presented with a use case that should
have generated a patient record with multiple diagnoses representing the
patient’s primary comorbidities. None of them did. In fact the experts at each
vendor had many different reasons why this was the case, but the fact remains
that the diagnosis & treatment of these clusters (e.g. Hypertension/Diabetes/Obesity/Congestive Heart Failure) are
essential in both providing more effective care with better outcomes & in
reducing costs. There is no facility in any of these systems for supporting
workflows that span organizations (HIEs, eReferral…), & none of the vendors
surveyed had any plans to support this capability.
In summary:
- Relational database systems were not designed to manage multiple TBs of data let alone multiple PBs of data, nor was SQL designed to be able to do the kind of query & modeling that will increasingly be required.
- Current data warehouse technology is not flexible to provide storage for the extremely heterogeneous range of data & data types that will be required to manage & analyze in the near future.
- The report generator & BI tools now in use are not adequate to deal with the volumes of data or to create & manipulate the descriptive & predictive models that will increasingly be how data is analyzed in the near future, &
- Current HIT systems, such as PM & EHR systems, do not - & in many instances – cannot provide the ability to document & treat today’s complex patients, nor can they support the types of multi-organizational shared workflows that will be required in the near future.
New data management, storage & analytic capabilities are
needed in order to support the new workflow, process & decision making
capabilities that will be required in the near future (next 3-5 years). Waiting
3 years until current capabilities have proved inadequate is not an option.
It is important to emphasize that this evolution is not just
about information technology, or technology at all for that matter. It really
is about a change in thinking for healthcare organizations, & that change
has to do with how to think about data & data usage. It is about developing
a sense of data awareness in all personnel, not just in the IT staff or in
people who make decisions. All people in the organization make decisions, some
strategic & some tactical, but all of these decisions must now be made with
a new awareness of data. This awareness includes understanding:
- What internal & external data is required to facilitate the decision?
- Is the data available? Where is it located? Is it accessible? Can it be acquired?
- What is the quality of the data? Does its quality limit how it can be used? How can data quality be improved?
- What is the most effective type of analysis to support the required decision(s)? Can we perform such analyses? Can we interpret them? Can we utilize the results in the context of our strategy or tactic?
The development of this data awareness & the alignment
of analysis with strategic decision making can be called the use of “data as an
asset” (D3A). This development requires training, discussion & consensus
building in order for the organization to adopt D3A as a core capability.
Clearly, healthcare organizations cannot abandon their
current process & technology infrastructure. Just as clearly, they will
need to continue to use their current PM & EHR systems in order to meet
operational needs & regulatory requirements. How, then, can these
organizations begin to move toward a process & technology infrastructure that
supports new needs & requirements & is relatively simple &
inexpensive. Here are a series of steps that begin this process.
1.
Begin discussion on relationship of strategy,
decision making & analytics, emphasis of data awareness across the
organization & data as an asset as well as use of all relevant data,
internal & external
2.
Evaluation of current & near-future partners
to determine need for cross-organizational workflows
3.
Assessment of current information infrastructure
& software application inventory to determine gaps in meeting near-future
connectedness, storage & data management & analytic capabilities
4.
Make decisions on evolution of information
infrastructure to support new data & analytic requirements
5.
Deployment & testing of new infrastructure
elements
6.
Personnel training
7.
Process modification as needed
8.
Piloting new analytic capability &
integration of decision making
Apart from the non-technical aspects of this evolution,
there are a number of possible directions that the evolution of the information
infrastructure could take. These include at least:
- Continuing with the current infrastructure – This is not feasible as already discussed. The changes coming in the next several years in terms of required interoperability, data sharing, data volume & data heterogeneity will make the use of current storage, information management & analytic infrastructure increasingly difficult.
- Creation of data extracts &/or data warehouse using conventional (relational) technology – also not feasible as discussed
- Creation of data extracts &/or data warehouse using a new model databases (NoSQL, Column-based etc.) – provides the scalability & some of the ability to deal with heterogeneous data that is required & could serve as an interim tactic, may still require considerable data normalization & translation
- Use of an ultra-large scale file system or ultra-large scale data store coupled with a contemporary analytic system – Systems such as Google Big Table or any of the various open source & proprietary Hadoop implementations provide almost unlimited scalability & the opportunity to manage & analyze very heterogeneous data, coupled with an analytic (eco)system such as Yarn (Map Reduce2) &/or the use of an analytic programming language such as Pig, R etc. allows for the development of a range of descriptive & predictive models that use the ultra-large information store as the basis for analysis
This last alternative allows for the management &
analysis of an almost unlimited amount of very dissimilar data. A conventional
data extract or data warehouse can be used to populate the information store
(Hadoop Distributed File System, Big Table, etc.). Second generation systems
are already becoming available such as IBM Spark which appears to be more
performant than Hadoop is currently, although custom programming (for instance
in R) against Google Big Table is very performant.
The continued development & adoption of these systems in
healthcare seems to be the best alternative over the next 3-5 years. It
provides the ability to manage & utilize almost unlimited amounts of data –
at least multiple petabytes – although Google is thought to have between 1
& 2 exabytes (260 bytes, 1016 bytes, 1024 PBs, a
truly immense amount) of data under management. The use of various modules that
allow SQL query of these systems (e.g.
Cloudera Impala) provides an easy entre, although development of analytic
models in R or Python provides very deep analytic capability in ways that SQL
query does not. Finally, the fact that many of these systems, especially the
Hadoop-based ones, are open source means that they will continue to evolve in
productive ways. Adoption of such a system is relatively easy, especially if
SQL query is used as the initial analytic capability. I am currently doing a
large project deploying Hadoop-based analytic stacks into Federally Qualified
Health Centers (FQHCs) & one of my next posts will describe the lessons
learned in deployment & adoption in this project.
Up next:
-
Lessons Learned in Deploying Hadoop-based
Analytics in the Healthcare Safety Net
-
Intelligent search, big data, deep
learning in healthcare information technology (& healthcare in general)
-
Design as a model for the evolution of
work… What will future knowledge workers be like?
-
& further in the future… a
“meditation” on the evolution of information technology using the “cluster of
terms” model[12]
[1] Data from Bloomberg School of Public Health,
Partnership for Solutions Program, Johns Hopkins University
[2] Yang , L. & G.A. Colditz. 2015. Prevelence of
Overweight & Obese in the U.S., 2007-2012.
JAMA Int. Med. Published online 22 June 2015.
[3] C.S. Wallace, Statistical and Inductive
Inference by Minimum Message Length, Springer-Verlag
(Information Science and Statistics), ISBN
0-387-23795-X, May 2005
[4] C.S. Wallace, Statistical and Inductive
Inference by Minimum Message Length, Springer-Verlag
(Information Science and Statistics), ISBN
0-387-23795-X, May 2005
[5] Sintchenko, V. & E. Coiera. 2006. Decision
Complexity Affects the Extent and Type of Decision Support Use. AMIA Ann. Symp.
724-728
[6] Codd, E.F. (1970). "A Relational Model of Data for Large
Shared Data Banks". Communications of the ACM 13 (6): 377–387. doi:10.1145/362384.362685.
[7] The current author (DJH) was the Architect for Rdb at
V1 & V2 for the Digital Equipment Corporation.
[8] Deshpande, R. & B. Desai. 2014. Limitations of
Datawarehouse Platforms and Assessment of Hadoop as an Alternative. Int. J.
Inform. Tech. & MIS. 5(2) May-Aug 2014. Pp 51-58.
[9] Health Centers & the Data Warehouse. 2008. Grob,
M. & D.J. Hartzband. Funded by the National Association of Community Health
Centers under HRSA/BPHC Cooperative Agreement U30CS08661.
[10] The author (DJH) was the representative to the ANSI
X3H2 Committee from the Digital Equipment Corporation.
[11] reported in: Path2Analytics Project Review.
Association of Clinicians for the Underserved Annual Meeting. June 2, 2015.
Alexandria, VA
[12] Foster, Hal. 2015. Bad New Days: Art, Criticism, Emergency.
Verso. NYC. 208 pp.
No comments:
Post a Comment