Thursday, September 10, 2015

Data as an Asset: The Evolution of Healthcare Information Technology


 We’re entering a new world where data may be more important than software.
                                                             Tim O’Reilly


 




The last five years have seen a huge amount of change in the way healthcare is delivered in this country & in the information systems that broadly support this delivery. Much, but not all, of this change has been brought about by the American Recovery & Reinvestment Act (2/2009, Pub.L. 111-5, ARRA) & the Patient Protection & Affordable Care Act (3/2010, Pub.L. 111-148, ACA), as well as changes to the Center for Medicare & Medicaid Services (CMS) Physicians Fee Schedule (PFS 2012-2016).

These laws & regulations, along with others, have created an environment in the healthcare space in which providers & healthcare organizations have had to respond to:
  • Requirements for the Meaningful Use of healthcare information technology (HIT) in order to receive enhanced payments from CMS
  • Participation in new types of organizations including Healthcare Information Exchanges (HIEs) & Accountable Care Organizations (ACOs)
  • Transition to new forms of reimbursement as the system moves from a fee-for-service basis to value & performance based payment.

These represent huge structural & functional changes in the healthcare system, & are subsequently creating issues & requirements as the system evolves. These include:
  • Need for increased connectedness among providers & healthcare organizations
    • New organizational models such as HIEs & ACOs require higher levels of data sharing to improve care. This in turn requires the ability to share process & workflow across organizational & temporal boundaries
    • Currently about 280 HIEs in the U.S. These require substantial network connectivity & a shared data architecture in order to be able to do anything more than “fax PDFs” among participants.
    • Currently about 450 ACOs in the U.S. These need to be able to share demographic (identity) & cost/patient data in order to document shared risk criteria
    • eReferral networks  provide capability currently supported by fax machines. They mostly use DIRECT technology to do data sharing, but also have begun to offer shared clinical decision support
    • All of these organizational models provide the opportunity for shared process including: clinical decision support, analytics, common & shared workflow for diagnosis & treatment, shared operational processes for operational coordination
  • Need to access, manage & utilize much larger amounts of data as well as many more different types of data
    • Most Community Health Centers (CHCs) have 150-250 GBs of EHR & PM data (usually representing 3-5 data years)
    • Most Primary Care Associations (PCAs) have 5-10 TBs of data
    • Most small clinics & hospitals are similar
      • These values will more than double in the next 3-5 years
      • Data sources will include: clinical & financial data from: EHR, PM, HIEs & ACOs, eReferral, State & Federal public & population health, public demographic & macroeconomic trend data, etc.
    • Some organizations already have ultra-large amounts of data
      • Kaiser is said to have 9.5M patient records for 15 years (~40-50 PBs)
    • Many healthcare organizations, at all levels, are starting to create data extracts or data warehouses using existing (relational database) technology
  • Increased patient complexity & associated treatment[1]
    • 29& of the U.S. adult population has hypertension, average of 3 comorbidities: arthritis, high cholesterol, heart disease & diabetes
    • 9%-10% of the U.S. adult population has diabetes, average 3.3 comorbidities including: obesity, hypertension, high cholesterol, heart disease arthritis
    • New study finds 75% of U.S. men & 67% of U.S. women are obese[2]
    • Patients with 3 comorbidities cost $6,178/year more to treat1
  • Increased strategy & decision complexity
    • Decisions, especially strategic decisions, need to be evaluated using much data as well as different types of data
    • Such decisions can be shown to be more complex with respect to how much cognitive effort they require[3] & the informational complexity of both their representation & their alternatives[4]
    • There is some evidence to show that as decision complexity increases, people tend to use less probability-based decision support[5]

These structural & functional changes will require healthcare organizations to evolve, & this means that their information systems will also have to evolve. Current operational & clinical processes (workflows) are not adequate to support this evolution. Healthcare organizations of all types will need to move toward new care & treatment models that emphasize both multi-organizational shared processes & utilization of much larger amounts of data. The emphasis on data & process sharing is especially important to provide appropriate continuity of care & transitions of care (personnel, locational etc.). New structural models such as HIEs & eReferral networks will require workflows that span organizational boundaries. A provider working with data from an HIE or eReferral partner will need a process that takes into account external data & may need a process that allows for more direct for intermixing of workflows between partners.  Working in such multi-organizational entities will require new workflow tools that provide for much more collaboration. These workflows have the possibility of adding considerable complexity to patient treatment & will have to be carefully designed to avoid this. In addition, working in ACOs will require the ability to generate, share & use empirical cost/patient data. Many healthcare organizations estimate this from claims data or actually use claims data as a surrogate for cost. This is not possible in the ACO model. The poor alignment between cost & clinical data makes this difficult. Many systems have no data connection (common fields or keys) between their financial system data & their EHR data. This will have to change in order for the ACO model to work with a broad set of healthcare organizations.

More than the workflow & data environments will have to change. The systems that support these systems will also have to change. Current network, server, storage & applications infrastructure will have to evolve to meet the requirements for information sharing & the use of much larger amounts of data.

Most HIT systems in use today, including Practice Management & EHR systems, as well as other directories (immunization, etc.) & data warehouse efforts are based on underlying relational database technology. Relational database (RDBMS) technology was, in turn, based on the relational model of data proposed by E.F. “Ted” Codd in 1970[6]. Efforts to develop software systems based on this model started at UC Berkeley & IBM in 1973. By 1984, commercial products were available from Ingres Corporation (INGRES), IBM (System R) & the Digital Equipment Corporation (Rdb)[7]. At this point, the design for RDBMSs is 45 years old – perhaps >10 generations of software evolution have occurred since then. I know from personal experience that these systems were not designed for the management of close to unlimited amounts of data, & even though they have been updated continuously during the last 45 years, they do not provide an adequate foundation for current & near-future use. RDBMS systems are not appropriate, or even highly functional, at levels of data >1PB. There are already some healthcare organizations that have more data than this.

In addition, data in RDBMSs needs to be in highly structured (relational normal form) & alphanumeric in type. Large advances have been made in the management & utilization of unstructured data in the last 25 years, but RDBMs still manage this material as binary large objects (blobs) that are not searchable or utilizable except as a whole. As already stated, healthcare data even for individual organizations, is fast approaching a size that makes the use of relational systems infeasible. The real issue, however, is the necessary heterogeneity of this data. PM & EHR systems generate both structured & unstructured data, such as physician’s notes. There is also financial data that is generated & managed by separate systems, that even though structured is in very different formats & not relationally tied to the clinical, demographic & financial (claims) data managed in PM & EHR systems. Then there is additional data, generated by separate operational systems from HIE & ACO partners. Finally, there is a whole “ocean” of both structured & unstructured data in such systems as: State & Federal public & population health sources, various demographic & financial trend data from public & private sources, & a whole variety of of other sources & types of data required for participation in new types of organizations & for the analytics increasingly required for clinical & operational optimization. Aggregation of this data, even virtually, through current conventional means (data warehouse based on RDBMS technology) is a daunting prospect as all of the data would have to be empirically & semantically normalized… in many cases an infeasible task [8].

Recently, healthcare organizations have attempted to get around some of these limitations by creating data extracts & data warehouses with their in-house data. A data warehouse is “an extract of an organization’s data - often drawn from multiple sources - to facilitate analysis, reporting and strategic decision making. It contains only alphanumeric data, not documents or other types of content. The data is stored separately from the organization’s primary applications and databases such as practice management systems and electronic health records. The data is transformed to match a uniform data model, cleansed of duplicates and inaccuracies, and is extracted with business intelligence and reporting tools. A data warehouse contains the entire scope of data and can be used for both very general and very specific analysis and reporting.”[9] In other words, it is a single, accessible data storage facility separate from any of the data sources (PM, EHR etc.) that is normalized & highly structured according to a single data model that spans all of the data sources. It generally is designed to contain as much of the data of an organization (or multiple organizations) as is possible, Data that cannot fit into this structure is not part of the warehouse. A data extract is similar except that is a subset of the data warehouse designed for specific purpose such as to provide the data for a particular report or set of reports. The development of a data warehouse is a very large effort, often multiple years, that includes a long design process, a long data extraction, normalization, translation & load process, & a long testing process. Data extracts may be updated on a short time period basis, even nightly, but data warehouses are updated more rarely, so that the data in them, while correct, may be stale except for retrospective analysis. Data extracts & warehouses tend to be too labor intensive in design & implementation & too rigid with respect to the need for normalization & translation to be of much use in an environment where data is constantly generated & extremely heterogeneous, as already mentioned.

Most healthcare organizations use either a report generator or a BI tool to provide query capability of the primarily PM & EHR data that they manage. All but the very largest or best funded tend to run canned reports that are programmed to be executed at specific intervals. These reports include such things as: amount billed per time period, number of patients per time period etc. Also, most organizations run required quality measures that must be reported to various government & accrediting agencies. Report generators & BI tools are adequate to perform these analytic tasks, but they are inadequate in several areas. The first is in the amount of data that they can deal with. Most BI tools & report writers are limited to accessing data from relational databases, so they have whatever limitations the underlying database has. Many, though, have additional limitations on the amount of data that they can use in a report. Tools such as Cognos (IBM), Business Objects (SAP) & Crystal Reports (SAP) all have limitations on either the number of tables they can draw data from or the number of parameters they can utilize. These are substantial limitations in the new universe of data. Also their reliance on an underlying relational database is, in itself a substantial limitation. Finally, the tools are mostly limited to the use of SQL as a query language. The ANSI X3H2 committee standardized SQL as a database query language in 1986[10].  It was intended as a specialized, non-Turing complete language for the manipulation of relational data according to the relational algebra & tuple calculus described by Ted Codd & initially developed by IBM in the 1970s. It was never intended as a general inquiry or modeling tool, & in fact, is not appropriate as one. Finally, these tools rely on data normalization & translation capabilities, often not supplied by the tool, in order to be able to manipulate data & produce reports.

It appears that current HIT systems are not designed in such as way as to provide adequate support for the documentation & treatment of today’s complex patients or for the support of the new types of workflows that are required by new regulation & organizational models. A recent informal survey by the author[11] was intended to determine whether several EHR systems currently in use could produce a patient record with multiple comorbidities reported without untoward effort. Six different EHR vendors were presented with a use case that should have generated a patient record with multiple diagnoses representing the patient’s primary comorbidities. None of them did. In fact the experts at each vendor had many different reasons why this was the case, but the fact remains that the diagnosis & treatment of these clusters (e.g. Hypertension/Diabetes/Obesity/Congestive Heart Failure) are essential in both providing more effective care with better outcomes & in reducing costs. There is no facility in any of these systems for supporting workflows that span organizations (HIEs, eReferral…), & none of the vendors surveyed had any plans to support this capability.


In summary:
  • Relational database systems were not designed to manage multiple TBs of data let alone multiple PBs of data, nor was SQL designed to be able to do the kind of query & modeling that will increasingly be required.
  • Current data warehouse technology is not flexible to provide storage for the extremely heterogeneous range of data & data types that will be required to manage & analyze in the near future.
  • The report generator & BI tools now in use are not adequate to deal with the volumes of data or to create & manipulate the descriptive & predictive models that will increasingly be how data is analyzed in the near future, &
  • Current HIT systems, such as PM & EHR systems, do not - & in many instances – cannot provide the ability to document & treat today’s complex patients, nor can they support the types of multi-organizational shared workflows that will be required in the near future.
New data management, storage & analytic capabilities are needed in order to support the new workflow, process & decision making capabilities that will be required in the near future (next 3-5 years). Waiting 3 years until current capabilities have proved inadequate is not an option.

It is important to emphasize that this evolution is not just about information technology, or technology at all for that matter. It really is about a change in thinking for healthcare organizations, & that change has to do with how to think about data & data usage. It is about developing a sense of data awareness in all personnel, not just in the IT staff or in people who make decisions. All people in the organization make decisions, some strategic & some tactical, but all of these decisions must now be made with a new awareness of data. This awareness includes understanding:
  • What internal & external data is required to facilitate the decision?
  • Is the data available? Where is it located? Is it accessible? Can it be acquired?
  •  What is the quality of the data? Does its quality limit how it can be used? How can data quality be improved?
  • What is the most effective type of analysis to support the required decision(s)? Can we perform such analyses? Can we interpret them? Can we utilize the results in the context of our strategy or tactic?

The development of this data awareness & the alignment of analysis with strategic decision making can be called the use of “data as an asset” (D3A). This development requires training, discussion & consensus building in order for the organization to adopt D3A as a core capability.

Clearly, healthcare organizations cannot abandon their current process & technology infrastructure. Just as clearly, they will need to continue to use their current PM & EHR systems in order to meet operational needs & regulatory requirements. How, then, can these organizations begin to move toward a process & technology infrastructure that supports new needs & requirements & is relatively simple & inexpensive. Here are a series of steps that begin this process.

1.     Begin discussion on relationship of strategy, decision making & analytics, emphasis of data awareness across the organization & data as an asset as well as use of all relevant data, internal & external
2.     Evaluation of current & near-future partners to determine need for cross-organizational workflows
3.     Assessment of current information infrastructure & software application inventory to determine gaps in meeting near-future connectedness, storage & data management & analytic capabilities
4.     Make decisions on evolution of information infrastructure to support new data & analytic requirements
5.     Deployment & testing of new infrastructure elements
6.     Personnel training
7.     Process modification as needed
8.     Piloting new analytic capability & integration of decision making

Apart from the non-technical aspects of this evolution, there are a number of possible directions that the evolution of the information infrastructure could take. These include at least:
  • Continuing with the current infrastructure – This is not feasible as already discussed. The changes coming in the next several years in terms of required interoperability, data sharing, data volume & data heterogeneity will make the use of current storage, information management & analytic infrastructure increasingly difficult.
  • Creation of data extracts &/or data warehouse using conventional (relational) technology – also not feasible as discussed
  • Creation of data extracts &/or data warehouse using a new model databases (NoSQL, Column-based etc.) – provides the scalability & some of the ability to deal with heterogeneous data that is required & could serve as an interim tactic, may still require considerable data normalization & translation
  • Use of an ultra-large scale file system or ultra-large scale data store coupled with a contemporary analytic system – Systems such as Google Big Table or any of the various open source & proprietary Hadoop implementations provide almost unlimited scalability & the opportunity to manage & analyze very heterogeneous data, coupled with an analytic (eco)system such as Yarn (Map Reduce2) &/or the use of an analytic programming language such as Pig, R etc. allows for the development of a range of descriptive & predictive models that use the ultra-large information store as the basis for analysis

This last alternative allows for the management & analysis of an almost unlimited amount of very dissimilar data. A conventional data extract or data warehouse can be used to populate the information store (Hadoop Distributed File System, Big Table, etc.). Second generation systems are already becoming available such as IBM Spark which appears to be more performant than Hadoop is currently, although custom programming (for instance in R) against Google Big Table is very performant.

The continued development & adoption of these systems in healthcare seems to be the best alternative over the next 3-5 years. It provides the ability to manage & utilize almost unlimited amounts of data – at least multiple petabytes – although Google is thought to have between 1 & 2 exabytes (260 bytes, 1016 bytes, 1024 PBs, a truly immense amount) of data under management. The use of various modules that allow SQL query of these systems (e.g. Cloudera Impala) provides an easy entre, although development of analytic models in R or Python provides very deep analytic capability in ways that SQL query does not. Finally, the fact that many of these systems, especially the Hadoop-based ones, are open source means that they will continue to evolve in productive ways. Adoption of such a system is relatively easy, especially if SQL query is used as the initial analytic capability. I am currently doing a large project deploying Hadoop-based analytic stacks into Federally Qualified Health Centers (FQHCs) & one of my next posts will describe the lessons learned in deployment & adoption in this project.

Up next:
-       Lessons Learned in Deploying Hadoop-based Analytics in the Healthcare Safety Net
-       Intelligent search, big data, deep learning in healthcare information technology (& healthcare in general)
-       Design as a model for the evolution of work… What will future knowledge workers be like?
-       & further in the future… a “meditation” on the evolution of information technology using the “cluster of terms” model[12]


[1] Data from Bloomberg School of Public Health, Partnership for Solutions Program, Johns Hopkins University
[2] Yang , L. & G.A. Colditz. 2015. Prevelence of Overweight & Obese in the U.S., 2007-2012.  JAMA Int. Med. Published online 22 June 2015.
[3] C.S. WallaceStatistical and Inductive Inference by Minimum Message Length, Springer-Verlag (Information Science and Statistics), ISBN 0-387-23795-X, May 2005
[4] C.S. WallaceStatistical and Inductive Inference by Minimum Message Length, Springer-Verlag (Information Science and Statistics), ISBN 0-387-23795-X, May 2005
[5] Sintchenko, V. & E. Coiera. 2006. Decision Complexity Affects the Extent and Type of Decision Support Use. AMIA Ann. Symp. 724-728
[6] Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM 13 (6): 377–387. doi:10.1145/362384.362685.
[7] The current author (DJH) was the Architect for Rdb at V1 & V2 for the Digital Equipment Corporation.
[8] Deshpande, R. & B. Desai. 2014. Limitations of Datawarehouse Platforms and Assessment of Hadoop as an Alternative. Int. J. Inform. Tech. & MIS. 5(2) May-Aug 2014. Pp 51-58.
[9] Health Centers & the Data Warehouse. 2008. Grob, M. & D.J. Hartzband. Funded by the National Association of Community Health Centers under HRSA/BPHC Cooperative Agreement U30CS08661.
[10] The author (DJH) was the representative to the ANSI X3H2 Committee from the Digital Equipment Corporation.
[11] reported in: Path2Analytics Project Review. Association of Clinicians for the Underserved Annual Meeting. June 2, 2015. Alexandria, VA
[12] Foster, Hal. 2015. Bad New Days: Art, Criticism, Emergency. Verso. NYC. 208 pp.

No comments: