The state of the art in health data analytics

Lola Koktysh

Lola Koktysh

Lola Koktysh

Lola Koktysh

With 6 years of writing on business and technology, Lola is a Healthcare Industry Analyst at ScienceSoft, a software development and consulting company headquartered in McKinney, Texas. Being a HIMSS member, she focuses on Healthcare IT, highlighting the industry challenges and technology solutions that tackle them. Lola’s articles explore chronic disease management, mHealth, healthcare data analytics, value-based care delivery, CMS regulations and more.


There’s a multiuniverse of data in healthcare. As of 2012, digital healthcare data worldwide was estimated to be equal to 500 petabytes and is expected to reach 25,000 petabytes in 2020, IBM TJ Watson Research Center reports.

While there certainly will be a lot of noise within these data piles, most of the information accumulated across care continuum can be analyzed and put to work. No rocket science, healthcare data analytics is already here for decades. Still, March 2017, and Healthcare IT News voices concerns of many healthcare executives about putting patient health data to use for consistent and continuous care improvement.  

They say if you can’t measure, you can’t improve. So, when it goes to ‘where to start’, the answer is defining all critical goals and data sources that keep the keys to achieving them.

Health data sources

Health data sources

Here our data analytics consultants concentrate on base-level patient health data, related to individual’s condition, not the aggregate information on all patients. Considering individual patients, the following health data sources can be classified.


When the patient’s condition is assessed by a physician (be it a first appointment or a follow-up), the information gathered and recorded in the EHR includes:

Personal data:

  • Sex
  • Age
  • Occupation
  • Date of birth
  • Marital status
  • Insurance
  • Contacts and more

Health data:

  • Chief complaint (CC)
  • Symptoms (frequency in urination, skin rash, stomachache, cough, etc.)
  • Comorbidities, traumas
  • Vitals (temperature, pulse rate, respiration rate, blood pressure, and more depending on symptoms)
  • Lab results (if available, including blood, urine, other body fluids tests)
  • Lifestyle choices (physical activity, habits, nutrition, etc.)
  • Diagnosis (primary, secondary)
  • Treatment (medications, procedures, etc.)
  • Childhood diseases
  • Family diseases
  • Allergies

Transaction data:

  • Claims and other billing records (if available)

Patient-generated health data (PGHD)

PGHD is the type of data provided by the patient or their family members. Currently, the general way to gather and share PGHD is by using portable medical devices / smart wearables and mHealth apps. Patients can use CGMs, smart watches, insulin pumps, fitness trackers, oximeters, Holter monitors, hand-held capnographs, their smartphones and other gadgets to capture data. Then this information is synchronized with particular mHealth apps and can be sent to a caregiver.

There can be objective and subjective PGHD. Objective data includes weight, heart rate and activity, blood pressure, blood glucose, temperature, oximetry results and more. Subjective implies patient’s mood, sleep, pain, itching, etc.

The role of patient-generated health data is to supplement clinical data gathered during appointments, tests and procedures so that providers would be able to create a more comprehensive picture of patient’s health status. PGHD is especially valuable for chronic disease management and post-operative rehabilitation.

Laboratory results

We specifically highlight lab results as a separate data type, because it creates a decision point in diagnostics and treatment. The test’s description might include the following information:

Fluid or tissue:

  • Blood
  • Urine
  • Stool
  • Semen
  • Saliva
  • Sweat
  • Amniotic fluid
  • Pleural fluid
  • Exudate
  • Transudate and more

Type of scale:

  • Quantitative
  • Ordinal
  • Nominal
  • Narrative


  • Mass
  • Volume
  • Chemical components (blood glucose, electrolytes, enzymes, hormones, lipids, etc.)
  • Time stamp
  • Type of method (procedure used for the test)
  • Time aspect (interval of time for observation or measurement) and more

Medical imaging

Another huge, complex and distinct data type is medical imaging, where visual information varies across modalities and can be presented in 2D or 3D formats:


  • Radiography
  • MRI
  • Ultrasound
  • CT
  • Fluoroscopy
  • Mammography
  • Angiography

Nuclear medicine:

  • PET

Optical imaging:

  • OCT

Moreover, quantitative imaging biomarkers or QIBs can be extracted within most of the modalities above. QIBs reflect underlying physiological or biophysical processes on medical images and able to help in diagnosing, staging, treating and monitoring a wide range of diseases. While quantitative imaging biomarkers aren’t yet widely adopted neither in research nor in clinical settings, multiple studies prove their value for non-invasive patient screening.

Inpatient health monitoring

Inpatient health monitoring implies continuous data accumulation and provides critical information for such care areas as Anesthesia, PACU, NICU, Critical Care and Emergency Care. Specific monitoring technologies are involved in the process, and they allow to measure and track a variety of vitals, including:

  • Pulse rate
  • Pulse oximetry
  • Metabolic and gas exchange
  • Temperature
  • Total hemoglobin
  • Arrhythmia analysis
  • Cardiac status
  • Anesthesia parameters, such as Entropy and NMT

Why analyzing?

Why analyzing health data

Clinical decision support

Here, providers have two options, which can be used separately or together – evidence-based medicine and diagnosis support. Evidence-based medicine is driven by insights extracted from health data (mostly diagnosis, procedure and treatment) combined with a knowledge base with similar cases, and used to find the most fitting treatment for each patient as well as to predict and avoid possible exacerbation, complication and readmission risks.

Diagnosis support, in turn, processes symptoms, lab results and patient history details to suggest possible conditions and procedures to confirm the disease, which assists in achieving timely treatment, balanced length of stay and positive health outcomes.

Safeguarding clinical trials

Patient health data can be used to analyze existing clinical trials to improve trial design and eligible patient finding. Providers can match prospective treatment with fitting patients better, reducing trial failures and negative health outcomes.

Workflow improvement

Quality insurance teams harness health data analytics to evaluate performance, understand clinical processes better and identify bottlenecks in care quality. They use information about procedures, primary and secondary diagnoses as well as lab tests to initiate process improvement activities, then monitor ongoing initiatives and their efficiency to ensure sustainable changes. For example, an increased number of C-sections can be rooted in simple coincidence and completely justified or unnecessary. Providers can analyze data about each patient’s indication to initiate C-section and find if there’s a need to intervene and revise this process to avoid unnecessary procedures.

Inpatient alerting

Alerting caregivers about changes in patient health status is critical for inpatient setting and care areas we defined for inpatient monitoring section. The systems acquiring vitals continuously analyze inbound data and warn health specialists about negative and positive trends, critical declines and peaks, so that surgery, post-surgery recovery or any other rehabilitation process would pass on smoothly.

Fraud prevention

Healthcare organizations can reduce improper billing and avoid erroneous or fraudulent claims on a pre-adjudication basis, not risking reputation and financials. To achieve that, the transaction data with claims and billing records is analyzed to find patterns indicating fraudulent activity or other irregularities, resulting in waste and abuse. According to Mike Cottle et al., (Transforming Health Care through Big Data) CMS achieved $4 billion in recoveries thanks to transaction data analytics with fraud detection capabilities.

Population health management

Data analytics can be used in multiple ways to benefit population health, but researchers concentrate on two dimensions – disease surveillance and chronic disease management.

Under disease surveillance, providers are analyzing diagnoses in the course of time to determine disease outbreaks and ensure speed response to them.

Chronic disease management is one of the most important goals in population health, especially in terms of reducing hospital readmissions. Here, PGHD data analytics contributes to the ongoing tracking of patients’ health status outside the hospital or clinic to allow providers initiate timely interventions, avoiding exacerbations, complications and admissions.

Patient profiling

Researchers in McKinsey note that health data analytics is helpful in patient outreach. Providers can apply advanced analysis (such as segmentation and predictive modeling) to patient profiles and identify patients with high health risks, determine those in need of particular services/procedures, and use this actionable information to offer individuals proactive care options.

Health data analysis methods: Time-proven and prospective

Before we talk about emerging health data analytics methods, there are four baseline methods that allow caregivers to analyze clinical performance and outcomes through the prism of patient health information:

Descriptive analytics 

Descriptive analytics allows providers to focus on current clinical issues and look into the reasons of improved or decreased outcomes. For example, caregivers may analyze how many patients need a pneumococcal vaccine or the number of diabetes patients with blood glucose under control.

Predictive analytics 

The most frequent issue healthcare organizations are to solve within value-based care approach, is readmissions. Therefore, providers want to make sure that the percent of patients returning to the hospital will be as low as possible and use predictive analytics to draw a possible percentage. Moreover, caregivers can look into possible admissions and predict days of stay according to patients’ health risks, habits, lifestyles, existing conditions and comorbidities. This data can help foreseeing the emergency room utilization.

Prescriptive analytics 

Prescriptive analytics implies helping caregivers measuring and managing patient population health, like focusing patients with obesity and diabetes and assess their LDL levels or other measurements. WHO has multiple tools for prescriptive analytics and population health monitoring, e.g. calculators of child mortality, health disparities, HIV prevalence and more.

Comparative analytics 

Comparative analysis allows caregivers to evaluate health outcomes of individual patients with similar diagnoses but different LOS, treatment, procedures and other health data.

Time-proven analysis techniques

We can also define a number of effective techniques within the four-piece group of general health data analysis methods above:

  • Data mining. It helps to discover patterns and trends in patient health data allowing to define underlying processes leading to diseases, such as recurrent episodes of skin rash, stomachache, hypertension, etc.
  • Text mining. This technique allows finding patterns and trends indirectly, by extracting quantitative parameters from unstructured text data – such as EHR entries in free text.
  • Online analytical processing (OLAP). OLAP is a set of tools allowing providers to analyze data in multiple dimensions simultaneously because it deals with preaggregated datasets. Online analytical processing uses three types of operations:
    • Slice-and-dice to extract a data subset and view it under different angles (looking at patient population in month, location or facility dimension is slicing, where choosing a few dimensions together is dicing)
    • Drill-down to focus on additional details (e.g. diving deeper in dimension, from month to particular days, from county to town, etc.)
    • Roll-up, the opposite of drill-down, used to consolidate information. It is a process of shifting the focus from one dimension to another. For example, viewing patient population by disease and location, providers can toggle one of the chosen dimensions to time.
  • Ad hoc analytics. It is a technique for deeper analysis, relying on non-aggregated data. It can be performed in a sandbox away from other data warehouse activities.

Prospective health data analysis

The two interesting cases we stumbled upon are concentrated on population health and precision medicine.

Social media magic forecasts disease outbreaks

Social media data analysis

In population health realm, social media data is now considered as patient behavioral data and can be used for predictions of disease outbreaks. The reason why it works is simple. During infectious disease outbreaks, official data from health organizations and reporting structures can be unavailable for weeks, hindering timely epidemiologic assessment. On the other hand, social media can spread the word in near real-time mode.

According to researchers, the trends in the volume of Twitter posts contained “cholera” or “#cholera” during 2010 Haitian cholera outbreak significantly correlated in time with official cholera case data but were available up to 2 weeks earlier. Additionally, Google Flu Trends correlated with influenza outbreak in 2012.

Genomic analytics elevates clinical decision-making

As for precision medicine, genomic analytics slowly but firmly makes its way into becoming a part of regular care decision process. The size of a single human genome is about 3GB, so processing and analyzing this data is resource-intensive in terms of time, budget and computational capabilities. Still, clinical researchers are working on the ways to execute gene sequencing more efficiently and cost effectively, because now precision medicine is heavily used in cancer research and treatment planning.

For example, one of the recent researches by UK National Health Service reports that a “46-gene hotspot cancer panel assay allowing multiple gene testing” from cancer biopsies (including melanoma and non-small-cell lung cancer) can find actionable mutations, for which targeted treatment is potentially appropriate, in 66% (71/107) and 39% (41 / 105) of melanoma and NSCLC patients, respectively. 

Researchers state that more extensive tumor sequencing can expand the number of treatment options available to patients while supporting appropriate therapeutic decisions. Moreover, the study concludes that the panel assay can be a cost-effective option if, instead of single gene testing, two, three or more genes are to be examined.

Challenges in healthcare data analytics

Challenges in healthcare data analytics

Ever-changing evidence-based medicine

Health specialists are constantly updating the agreed-upon knowledge about patient vitals and care delivery, thus the understanding of targets, measuring and monitoring goals is changing. For example, American Diabetes Association (ADA) releases updates on the Standards of Medical Care in Diabetes yearly, including new recommendations for prediabetes and hypoglycemia, as well as tweaks to diabetes nomenclature and care suggestions to reflect the clinical trial novelties. Therefore, caregivers need to enable their data analytics solutions with capabilities to process patient health data in line with the general industry updates.

Mixed data

Health data is anything but consistent and not all data are created equal. Coming from different sources, facilities, systems and devices, information quality varies. Even more, the same piece of data within one system, say, EHR, can be captured in different ways. A diagnosis, for example, can be documented in billing records, medical history, patient’s problem list, clinical narrative and more.

Accordingly, sometimes health specialists fill in the fields with drop-down lists, sometimes they tick the checkboxes, but sometimes they input information in free text (e.g., in the clinical narrative). This builds up a hoard of structured and unstructured data, which should be treated differently at first, but standardized in the end. Providers need to keep these potential issues in mind and invest into a resourceful system able to extract findings from unstructured information and then analyze them in conjunction with structured values.

Emerging regulatory requirements

Regulatory requirements and reporting standards will also increase and evolve, especially since pay-for-performance approach gets broader adoption overtime and entails higher transparency in quality and pricing information. Accordingly, CMS will look into updating certain benchmarks and incentives, some of them will be reworked eventually. These changes are only adding to the reporting and analyzing burden for healthcare organizations from ACOs to private practices.

Dealing with frustration from health data analytics

Leonard D’Avolio, the assistant professor at Harvard Medical School, said about Big Data technologies in healthcare that “for decades now, clinicians’, as well as healthcare administrators' frustrations with the way we have approached measurement and digital technology, are entirely justified.”

Well, being frustrated is okay if this feeling doesn’t stop you from moving further. The ways that health data analytics evolves with new sources and methods featuring PGHD, behavioral and genetic data are showing that health data analysis isn’t in stasis. It not only relies on time-proven techniques but discovers additional areas for improvements in evidence-based care, reporting, population health management and other dimensions.

Medical Data Analytics and Consulting by ScienceSoft

Analytics turns medical data into a treasure trove. Don't miss a chance to boost patient satisfaction, optimize costs and improve internal processes.