Stop, in the Name of Accuracy! An Introduction to Data Validation

traffic lightMistakes happen in the course of data entry. A research coordinator, intending to input a weight of 80 kilograms, leaves the field before striking the “0” key. Her colleague, completing a field for temperature, enters a value of 98, forgetting that the system expects the measurement in Celcius. But no adult enrolled in a clinical study weighs 8 pounds. And the patient with a body temp of 98 degrees Celsius? “Fever” is something of an understatement.

Left standing, errors like the ones above distort analysis. That’s why data managers spend so much time reviewing submitted data for reasonableness and consistency. What if it were possible to guard against introducing error in the first place? With electronic forms, it is possible.

“Edit checks,” sometimes called “constraints” or “validation,” automatically compare inputted values with criteria set by the form builder.  The criteria may be a set of numerical limits, logical conditions, or a combination of the two. If the inputted value violates any part of the criteria, a warning appears, stating why the input has failed and guiding the user toward a resolution (without leading her toward any particular replacement).

Edit checks may be simple or complex; evaluate a single item or a group of related items; prevent the user from moving on or simply raise a flag. You can learn all about these differences below. The goals of edit checks are universal: higher data quality right from the start!

Check yourself

Setting edit checks appropriately is all about balance. Place too many checks, or impose ranges that are especially narrow, and you’ll end up raising alarms for a lot of data that’s perfectly valid. That will slow down research coordinators who simply want to get you the data you need. Place too few checks, or allow any old values, and you’ll open the gates to a flood of nonsensical data. You or a data manager colleague will then need to address this data with the clinical site after it’s submitted. Wait too long, and you could discover that the site can’t determine what led to the error in the first place.

While there’s no exact formula for striking the right balance, there are guidelines. Any value that could signal a safety issue ought to receive a lot of scrutiny. For example, in a study investigating a compound known to impact kidney function, you’ll want to place careful constraints around an item asking for a glomerular filtraton rate. The same goes for measures tied to eligibility or constitutive of primary endpoints. On the other hand, it doesn’t make sense to enforce a value for height that’s within 10% of a population mean. Moderately short and tall people enroll in studies, too!

Variety is the spice of edit checks

All edit checks share the common objective of cleaner data at the point of entry. They also share a rigorous and logical method. Input is either valid or not, and the determination is always objective. Beyond this family resemblance, though, edit checks differ in their scope and effects.

Hard vs. soft

Hard edit checks prevent the user inputting data from proceeding to the next item or item group. Note that a validated system will never expunge a value once submitted, even if it violates a hard check. Rather, it will automatically append a query to the invalid data. Until the query is resolved, the form users won’t be able to advance any further on the form.

Soft edit checks, by contrast, allow the user to continue through the form. However, the user won’t be able to mark the form complete until the query attached to the check is resolved.

Hard and soft edit checks each have their place. If an out of range value would preclude further study activities, a hard edit check may be justified, as it sends a conspicuous “stop and reassess” message to the clinical research coordinator. Where an invalid piece of data is likely to represent a typo or misunderstanding (e.g. a height of 6 meters as opposed to 6 feet entered on a physical exam form), a soft edit check is preferable.

Univariate vs. multivariate

Univariate edit checks evaluate input against range or logical constraints for a single item–for example, the value for Height, in inches, must be between 48 and 84.

Multivariate edit checks, by contrast, place constraints on the data inputted for two or more fields. “If, then” expressions often power these checks: if field A is selected, or holds a value within this range, then field B must meet some related set of criteria. If a form user indicates a history of cancer for a study participant, a related field asking for a diagnosis will fire its edit check if a cancer diagnosis isn’t provided.

When input fails to meet a multivariate edit check, it’s important for the warning message to state which item values are responsible for the conflict. Suppose a research coordinator enters “ovarian cyst” on a medical history form for a participant previously identified as male. A well-composed error message on the medical history item will refer the user to the field for sex.

Standard vs. protocol-specific

Standard edit checks, such as those placed on items for routine vital signs, do not vary from study to study. Their value lies in their re-usability. Consider a check placed on the item for body temperature within a Visit 1 Vital Signs form; one, say, that sets a range between 96 and 100 degrees Fahrenheit. That check can follow that item from form to form, just as the form may be able to follow the Visit 1 event from study to study. There are no experimental reasons to set a range wider or more narrow than this commonly expected one.

A protocol-specific edit, by contrast, enforces on an item a limit or threshold dictated by the protocol. Imagine a study to determine the reliability of a diagnostic tool for prostate cancer in men at least 50 years old. The eligibility form for such a study will likely include protocol-specific edit checks on the items for participant sex and date of birth. Or consider an infectious disease study whose patient population requires careful monitoring of their ALT value. In this context, a value that’s just slightly above normal may indicate an adverse event, so the acceptable range would be set a lot narrower than it would be for, say, an ophthalmological study.

Query, query (when data’s contrary)

A research coordinator who enters invalid data may not know how to correct their input, even with the guidance of the warning message. Or their input may be perfectly accurate and intended, while still falling outside the range encoded by the edit check. In these cases, your EDC should generate a query on the item. Queries are virtual “red flags” that attend any piece of data that either:

  • fails to meet the item’s edit check criteria
  • raises questions for the data manager or data reviewer

The first kind of query, called an “auto-query,” arises from programming. The system itself flags the invalid data and adds it to the log of queries that must be resolved before the database can be considered locked. The second kind of query, called a “manual query,” starts when a human, possessing contextual knowledge the system lacks, indicates her skepticism concerning a value. Like auto-queries, manual queries must be resolved before the database can be locked.

To resolve or “close” an auto-query, the user who first entered the invalid data (or another study team member at the clinical site) must either:

  • submit an updated value that meets the edit check criteria
  • communicate to the data manager that the flagged data is indeed accurate, and should stand

The data manager may close a query on data that violates an edit check. In these cases, she is overriding the demands of the validation logic, but only after careful consideration and consultation with the site.

To resolve a manual query, the site user and data manager engage in a virtual back and forth–sometimes short, sometimes long–to corrobate the original value or arrive at a correction. A validated EDC will log each question posed and answered during this exchange, so that it’s possible to reconstruct when and why the value under consideration changed as a result.

Resolving a query isn’t just a matter of removing the red flag. If the data manager accepts the out of range value, she must indicate why. If the research coordinator inputs a corrected value, she too must supply a reason for the change as part of the query back and forth. The goal is to arrive at the truth, not “whatever fits.”

Electronic case report form with fields

In praise of skip logic

woman running happily through field having escaped a maze

“Please answer the following three questions if you answered ‘yes’ above.”

Instructions like the above are common on paper forms. A biological male, for example, won’t have a history of pregnancy. Asking him questions on this topic wastes his time, contributes to survey fatigue, and makes him more likely to abandon the form. When a users is forced to chart their own way through a form, the chances of missing a critical question increase. Meanwhile, some portions of your forms will be destined for irrelevance, which is a waste of paper and an encumbrance on the data compiler responsible for sifting through completed and skipped boxes.

Enter electronic case forms with skip logic. As with scores and calculations, skip logic takes advantage of the digital computer’s biggest strength: it’s ability to compute. Here, instead of summing or dividing values, the form evaluates the truth of a logical expression. Did the user either select ‘Yes’ for item one or check the box for ‘B’ on item two? Did she include a cancer diagnosis on last month’s medical history form? (Variables in skip conditions can range over multiple forms.) Form logic can deduce the answer instantly, and behave differently depending on what that answer is: showing or hiding an additional item, further instructing the coordinator or participant, or alerting the safety officer. The results? A better user experience, cleaner data capture, and no wasted screen space!

Combating survey fatigue

Even the most diligent research coordinator can find herself procrastinating when it comes to data entry, only to battle frustration when she does finally attempt to tackle the study’s eCRF. The same is true of patients, who often face the additional challenge of poor health. This resistance to starting and completing forms is called survey fatigue, and it’s very real. Survey fatigue slows the pace of data acquisition and undermines quality, as respondents consciously or subconsciously overlook items or supply their “best guesses” simply to finish the work. As data managers and form builders, we need to consider the respondent’s experience at all times. This includes asking only for information required for analysis and possessed by the researcher or study participant. Never ask these respondents to perform a task more ably performed through form logic and calculation. That includes applying skip logic to ensure that all of the questions we ask are relevant!

Let’s get logical!

Skip logic (also known as branching logic, conditional logic, or skip patterns) is the technical name for a common type of command:

If this is true, then do that; otherwise, do this other thing.

What this refers to may be simple; for example, the user selecting ‘yes’ on a yes or no question. Alternatively, this might be quite complex; for example, the user selecting both B and C, but not D, on item 10 or else responding ‘no’ to at least five items between items 11 through 20.  This is always either true or false, depending on whether the input conforms to an expression the data manager has written with Boolean algebra.

Because this is either true or false (and not both), either that or the other thing must occur (and not both). As with the conditions for this, it’s up to the form builder to state what that and the other thing are. Usually, she will try to model the protocol with this conditional command. For example: “Take this measurement. If it falls within this range, take this additional measurement. Otherwise, to proceed to the next assessment.”

The form below provides examples of skip logic with increasingly complexity. See if you can recognize the conditional command behind each one.


Boolean algebra: the rules at the heart of skip logic

At the foundation of all digital electronics lie three elementary ideas: AND, OR, and NOT. These are the basic operators of Boolean algebra. But they’re not just for circuit design. Facility with these operators is a must for anyone who wants to design a form that’s truly responsive to input. If you’re new to these concepts, check out this helpful video from PBS LearningMedia. From there, you can learn how to evaluate a Boolean expression for different inputs using truth tables. Finally, you’ll be able to write your own Boolean expressions, even very complex ones, to “program” your form to behave in certain ways based on certain inputs.

Thoughts on eligibility forms in the Journal for Clinical Studies

Eligibility is more than a checklist. It’s an essential instrument for patient safety and study integrity.  Recently, the Journal for Clinical Studies invited OpenClinica to share recommendations for eligibility forms that could best deliver that safety and integrity. We were honored to contribute to their most recent issue. Take a read and let us know what you think!

page 1 of a journal article on eligibility forms

What’s the score? Real-time scoring of the PHQ-9

Data capture screen illustrating scoring on a tablet

Heart disease. Lung cancer. Type II diabetes.

You. Me. The barista at the coffee shop.

For all the differences between the diseases above, each presents, if it does present, with a certain severity.  For all the varied experiences of the people above, each bears some risk of developing the diseases. How do we evaluate that severity? How do measure the risk? The answer is with a score.

What is a score?

A score is a value on an ordinal scale, used to classify the severity of a condition or to predict its future course. (Only a rigorous validation study can establish if and how well the score predicts.)  Instruments for generating scores take more basic measures like weight, blood pressure, or the presence or absence of some biomarker as their inputs, then combine these inputs in mathematically explicit ways. Crucially, a given score is calculated the same way from setting to setting and study to study, thus endowing the score with universal meaning.

Why are scores useful?

Scores characterize, classify, and predict. In cases of trauma or disease, a score (or a stage, or a grade) is what makes prognoses and treatment decisions possible. With scores so essential to clinical practice, it’s hardly surprising we encounter them so frequently in research. Eligibility criteria may set bounds to acceptable scores, to ensure safety or to tailor the investigation to a particular patient profile. A change in score over time may represent a primary outcome, suggesting a therapy’s superiority or inferiority to some comparator in reducing a disease burden or improving quality of life.

Scores are not comprehensive descriptions of a patient’s disease, much less of the patient herself. They are never perfectly predictive.  They are quantitative heuristics whose success in classifying the stage or severity of a disease, or in predicting the risk of its development, has been established through statistical studies.

How do researchers calculate scores?

Most (not all) scores are matters of simple arithmetic. Measure A, B, and C, then add them together. If D is present, add one. If D is absent, subtract one. A researcher mentally calculating a score for one patient in a calm, quiet setting stands a good chance of doing so correctly. But as inputs grow larger (34 x 72, say, instead of 4 + 9), so does the chance of a miscalculation. Asked to perform mental calculations again and again, for dozens of study participants, the researcher is all but guaranteed to make at least one mistake.

How does EDC make working with scores easier?

When it comes to computation, it’s hard to beat, well, a computer. Collecting data electronically facilitates rapid, accurate operations on that data. When data capture is web-based, the calculations may be shuttled between the clinical site and data management office almost instantly. That rapid exchange optimizes every stage of trial conduct. Real-time scoring at screening visits can stratify participants into cohorts. Scores that signal an adverse event can immediately trigger workflows for stabilizing the participant and submitting safety reports. When the time comes to analyze results, a portion of the statistical labor is already done.

Can you give me an example?

I was hoping you’d ask. Below you’ll find two presentations of the Patient Health Questionnaire-9 (PHQ-9), an instrument for screening, diagnosing, monitoring and measuring the severity of depression. The first presentation assumes out-of-clinic, ePRO use, where study managers expect most participants to respond on their own smartphone. The form will render on any web-enabled device, but the pagination is set to display one item at a time, for ease-of-use on smaller screens. The second presentation assumes in-clinic use on a tablet. Depending on the protocol, the researcher may administer the PHQ-9 through an interview with the participant, or the participant may complete the questionnaire on her own.  In both cases, best practices in ePRO form design dictated layout and behavior.

OpenClinica forms support real-time scoring through syntax any data manager can learn quickly. Data managers have complete control over when and if the score is displayed on the form, rather than simply saved to the database. Scores may be built into forms for patient-reported outcomes (ePRO) as easily as they are for clinic visit forms. And conditional logic can trigger additional questions or other workflows based on specific scores.

So give your study’s major eClinical system a chance to do what it does best, and score one for speed and accuracy.


Click here to view a tablet-based presentation of the PHQ-9.
Click the image above to view a tablet-based presentation of the PHQ-9.

Need a Glomerular Filtration Rate? Let the form do the math!

When it comes to math, the modern eCRF is no slouch. You’ve seen OpenClinica forms add, substract, multiply, and divide values at lightning speed. But these operations barely scratch the service.

OC4’s form engine supports a wide array of mathematical functions, from arcsins to exponents, all defined by clear, human-readable expressions. What does this mean for your clinical trial? As the data manager, you can help your site users derive complex measures, like the one below, with a consistent, error-free method. Meanwhile, site users never need to launch their calculator app or recall the right formula to apply.

The example below relies on a combination of “if then” logic and calculation to (1) identify the formula relevant to the participant’s race, sex, and serum creatinine, and (2) apply the formula to the input supplied by the user. The form completes both of these tasks in milliseconds. Give it a try!

Did you know? Glomerular filtration rate (GFR) is an estimate of how much blood passes through the glomeruli of the kidneys each minute. It is an indicator of kidney function. For adults between the ages of 20 and 59, a GFR below 90 mL/min/1.73m2 may suggest kidney disease, depending on whether other signs of kidney damage (e.g. protein in urine) are present. (Source: “Glomerular Filtration Rate: A Key to Understanding How Well Your Kidneys Are Working” from the National Kidney Foundation)

You can find an authoritative compendium of measures like this at MD+CALC, all of which can be implemented in OpenClinica forms.


Calculating GFR on a electronic case report form

Customers, download the form here. Requires sign in.


Would you like to see a different measure in an OpenClinica form? Leave a comment below!

Save the Date: November 7 and 8 in Santander, Spain

Mark your calendars! This year’s annual gathering will take place on Thursday, November 7 and Friday, November 8 in Santander, Spain. Super User training will be offered from Monday, November 4 through Wednesday, November 6. (For those mostly or entirely unfamiliar with OC4, Super User Training is an effective way to master the fundamentals of our solution before diving into the advanced use cases we’ll cover on Thursday and Friday.)

This year, it’s all about discovery and doing. We’ll spend our time together working directly in OC4: creating studies, building forms, and becoming familiar with the dozens of new features and enhancements that continue to make our current solution the solution data managers can rely on for performance, flexibility, and security.

Details are still coming together. Here are the basics:

  • Anyone wishing to take part in OC19 will be able to do so in person or online.
  • Registrants will receive access to an OC4 sandbox study in advance of the conference.

Interested in a special use case or how to? Email

The Four Criteria of a Perfect Eligibility Form: A Success Guide

Looking for another success guide? See our guides on cross form intelligence, date formatting, ePRO form design, and site performance reporting

In the months ahead, Journal for Clinical Studies will publish a detailed guide to designing eligibility forms–a guide authored by OpenClinica! The complete contents are embargoed until they appear (for free) on the journal’s website. As soon as it’s published, we’ll provide a link to it here. In the meantime, here’s a brief excerpt and an interactive form illustrating one of the guide’s four core principles, “Make your forms carry out the logic.”

from The Four Criteria of a Perfect Eligibility Form: A Success Guide, forthcoming in Journal for Clinical Studies

Think a moment about the human brain. Specifically, think about its capacity to carry out any logical deduction without flaw, time and again, against a background of distractions, and even urgent medical issues.

It doesn’t have the best track record.

Even the most logical research coordinator could benefit from an aid that parses all of the and’s, or’s, and not’s scattered throughout your study’s eligibility. A good form serves as that aid. Consider the following inclusion criteria, taken from a protocol published on

Inclusion criteria #1 is straightforward enough. (Although even there, two criteria are compounded into one.) By contrast, there are countless ways of meeting, or missing, criteria #2. It’s easy to imagine a busy CRC mistaking some combination of metformin dose and A1C level as qualifying, when in fact it isn’t.

But computing devices don’t make these sorts of errors. All the software needs from a data manager is the right logical expression (e.g., Criteria #2 is met if and only if A and B are both true, OR C and D are both true, etc.) Once that’s in place, the CRC can depend on your form to deliver perfect judgment every time. Best of all, that statement can live under the surface of your form. All the CRC needs to do is provide the input that corresponds to A, B, C, and D. The form then chops the logic instantly, invisibly, and flawlessly.

Test drive the form below to see a smart eligibility form in action. OpenClinica customers, be sure to visit the Eligibility section of the CRF library to download the form definition.

For more on designing forms that capture better data, faster, view our on-demand webinars from December 2018.

Turning the tables on patient-specific reference data

How much time do you have left?

Yes, in that sense. The existential one.

If the question is difficult to ask, it’s even harder to answer. Ask an actuary. Calculating life expectancy is a complex matter; more complex, at least, then plugging your date of birth and today’s date into a function. An informative life expectancy depends on a host of additional factors, like your sex, current health, and lifestyle habits.

“Multifactorial” calculations like the one above dominate medicine, so it’s no surprise that they should dominate clinical research, too. Take a plasma urea level of 39 mg/dL. Is that above, below, or within the normal range? The question is misconceived, because normal in this case is relative to patient age. A 30-year-old’s “slightly above normal” is a sixty-year-old’s “slightly below normal.”

Age is only one factor. For many ranges, patient gender, ethnicity, and co-morbidities, in addition to age, determine a normal range. Often, researchers can set these factors aside without raising undue safety concerns or undermining the generalizability of their results. But as personalized medicine continues to inform drug discovery and clinical care, researchers will turn to more finely-grained reference data more often. For this reason, data management systems must make it easy for these researchers to apply reference data that’s sensitive to as many factors as they choose.

Of course “easy”, just like “slightly below normal”, is a relative term–for the most part. In no context is a writing a lengthy formula of nested “if, then” clauses easy, e.g.

If the participant is male and Hispanic and between 18 and 25 years old and the test is for ALT, then set the lower limit to 12 U/L and the upper limit to 102 U/L, and if the participant is male and Hispanic and between 26 and 34 years old and the test is for ALT, then set the lower limit to…

Completing the formula above would mean assigning a lower and upper bound to every combination of gender, ethnicity, and age range. The process could easily take hours, just to set the normal limits of ALT. If the study involved a dozen analytes, the data manager would need to devote the better part of a week to programming these constraints. If, at a later date, any one of those constraints changed, he or she would face the unenviable task of modifying (without breaking) the original formula. Too many “modern” EDC systems force the data manager to soldier through this error-prone task. With paper, it’s a non-starter.

How much better, then–for efficiency and quality–to rely on a general constraint; one that leverages a tool that’s easy to build, easy to read, and easy to amend? I’m talking about the humble table.

Multiplication table handwritten with white chalk on school blackboard in wooden frame
Remember this table?

Yes, the table. For all our advancements in data architecture, the same grid that set us on the path to multiplication in second grade remains an asset today. It’s human readable, it’s intuitive, and it’s powerful.

Powerful? Really? How much can you accomplish with just two axes?

Great question! It’s true that most spreadsheet applications don’t offer more than two axes, at least not through their GUI. But who needs them when you have thousands of rows and hundreds of columns at your disposal?

Suppose I need to assign a unique value to every combination of three hand preferences (left, right, or ambidextrous), four eye colors (blue, green, brown, or hazel), and the eight blood types (O,O-,A+,A-,B+,B-,AB+,AB-). At first blush, it seems a table won’t suffice. I have more dimensions (three) than I do axes (two). But a single axis can accommodate any number of dimensions, because nothing prevents me from treating each combination of values on those dimensions as its own, n-factored value. For example, I can treat each triad of handedness, eye color, and blood type as one of 96 phenotypes.

Table column table, with first column listing every possible combination of handedness, blood type, and eye color. The second column assigns a unique whole number to each combination


Laying these combinations along a vertical axis, I can assign a value to each with just two columns.

Maybe I’m partial to a more compact format. If so, I can combine the variables from two dimensions to specify one axis, and let the variables from the third dimension define the other:

Table with the vertical axis listing handedness and eye color combinations, and the horizontal axis listing blood types.


Here I make the 96 assignments with 13 rows and 9 columns. (The virtue of this method is fewer total cells.)

In any case, I’m free to work with as many factors as the situation demands, and distribute them between the two axes in any way that makes the most sense to me. Leaning on a familiar format, I’ve made the difficult part of a multifactorial reference much easier. All that remains is to add to the form a simple instruction for “looking up” the values needed. Even if those values change, the form doesn’t need to.

Fair enough. But won’t real use cases require gargantuan tables?

Sure. But what’s gargantuan to you and me is a walk around the block for the right technologyOpenClinica’s EDC relies on fast and flexible XForms to move data through a nimble, microservices architecture, so “clinically-sized tables” pose no threat to smooth performance. Consider these common parameters:

  • 81 ages (18 years old to 99 year old)
  • 6 ethnic and racial categories
  • 2 genders
  • 40 analytes
  • 2 limits of normal (one upper, one lower)

A mere 972 rows (plus one header) accommodates every combination of age, ethnic and racial category, and gender. 80 columns (plus one on the left for analyte names) accommodates the 40 lower and 40 upper limits. The resulting 973 x 81 grid is small potatoes for database applications that power software like OpenClinica’s. Simple formulas in that context can retrieve the value from any coordinate within milliseconds.

Great. But what’s the big deal? I hardly ever need to apply reference data for this many factors at once.

Yes, a heart rate is a heart rate, and while population differences might exist for this measure, they’re hardly a concern on your vitals form. But don’t confuse the frequency of a need with its importance. Take safety. An insignificant drop in a lab value for one patient may portend real danger for another. Even apart from lab interpretation, though, tables can drive efficiency and accuracy. Dosing can vary between countries participating in the same study, due to differences in labeling and regulation. The same goes for eligibility and arm allocation. Whenever we try to account for these variables within our form, we accept programming delays and chances for error that we don’t need to accept. It is possible, of course, to make an error when assembling our table, but those errors are easier to spot and correct within a grid than they are in some extended, conditional formula. The tables themselves are easier to build in the first place, too, as their source data usually comes to us in the form of a spreadsheet. A little re-labeling of our first row and column, some testing, and viola: trusted references values are now a part of our study.

The lesson is simple, then. First, make sure you’re using the right EDC. Your form builder should allow you to specify reference data with tables, and your forms themselves should retrieve values in that table based on user input all but instantly. Second, use your two axes to their full potential: fill those rows and columns with as many dimensions as are relevant by tapping some basic combinatorics. Third, congratulate yourself.

You’ve just used a bit of the time you have left more wisely.

Real-world example: applying lab reference data that’s gender- and age-specific for two analytes

Not every analyte carries with it age- or gender-specific normal ranges. But for those that do, their differences are critical. In this example, I’m concerned with two levels from a blood serum panel: Insulin-like growth factor 1 (IGF-1) and Dehydroepiandrosterone-sulfate (DHEA-S). Both play a key role in several endocrinological disorders, and both have normal ranges that vary by age and gender.

Our example form first asks the user to specify the patient’s sex, patient’s date of birth, and date of sample collection. The form then calculates the patient’s age, in years, at the time of collection.


Next, the user is prompted to enter the value for IGF-1.

As soon as it’s entered, the form compares that value to the upper and lower limits of normal corresponding to the patient’s age and sex, as found on the table below. Note that the user’s selection for gender, together with the calculated age, combine to form a unique key (‘female40’).

The lower limit of normal (igf_ll) for a 40-year-old female is 106 ng/mL. The upper limit (igf_ul) is 267 ng/L. Because the entered value of 145 falls within that range, no query is raised.

The form then prompts the user to enter a DHEA-S level. For this analyte, the user enters 278 ug/DL. That value is outside the range for a 40-year-old female. As a result, an auto-query instantly fires.

The full reference table includes 191 rows…

  • 1 header row
  • 95 rows for men aged 18 to 112
  • 95 rows for women aged 118 to 112

… and 5 columns…

  • 1 column for the gender-age combinations
  • 1 column for IGF-1 lower limit
  • 1 column for IGF-1 upper limit
  • 1 column for DHEA-S upper limit
  • 1 column for DHEA-S lower limit

Introducing racial and ethnicity categories, along with more analytes, would multiply the area of our table. Six racial and ethnic categories combined with two genders and 95 whole-year ages would generate a total of 1,141 rows (6 x 2 x 95 combinations plus 1 header row). Specifying the upper and lower limits for three dozen analyzes would occupy 73 columns (2 limits x 36 analytes + 1 label column). The resulting 1,141 x 73 table would contain 197,393 cells, a total that’s 206 times greater than our original table’s cell count. Should you expect a proportional decrease your form’s response time? Not at all! The “lookup” still happens within milliseconds.

Headed down registry road? Here are the EDC features you’ll need.

Here in Massachusetts, with the March winds whipping and snow always a threat, a week’s vacation down south is common fantasy. Even if it means a 10-hour car ride, most of us relish the thought.

But suppose our usual set of wheels, a Mini Cooper, say, is in the shop. (Potholes the size of craters are a common reality here.) Instead of foregoing our vacation, we decide to rent a vehicle. Chances are another Mini Cooper won’t rank as our first choice. Sure, a car that size could get us from Boston to the Outer Banks. But at what cost to our comfort and cargo?

We can think of study designs as kinds of road trips, and our eClinical tools as vehicles. Randomized controlled trials (RCTs) and registry studies are only two such journeys, but they’re two of the most frequent we in the research community take. In both cases, most of us rely on electronic data capture (EDC) to help us reach our destination.

How do we choose the EDC “vehicle” that will get us there safely, with minimal delays? Marquee brand names matter less than road-tested features. Consider the relative importance of these EDC features in RCTs versus registries.


Feature RCTs Registries
Automatic reporting and notification Important, especially as interim analyses approach Very important, to maintain desired balance among subgroup sizes and to ensure that sites contact participants at the appropriate intervals
Interoperability Important, especially for trials that need to consume a high volume of lab and imaging data on a regular basis Very important, as EHR data can easily account for more than half of a registry data
Researcher ease-of-use Very important, to drive data entry timelines, reduce queries, and ensure quality Critically important, for the  reasons listed under RCTs, as well as to minimize collection burden and complement the flow of clinical care
Participant ease-of-use Often irrelevant, otherwise critically important, depending on whether patient-reported outcomes (PRO) are collected Often critically important, as PRO is a far more common data source for registries

Let’s look briefly at each four of these features in turn.

Automatic reporting and notification

Registries may be observational, but make no mistake: there’s still plenty to do, especially when it comes to ensuring the internal and external validity of the study design. As with RCTs, registries begin that task before the first participant is ever enrolled. Inclusion and exclusion criteria define the patient population from which the study will draw. Enrollment targets and duration parameters are set to deliver the necessary statistical power. Data elements are selected ahead of time, as are relevant outcomes.

But RCTs wield two defenses against bias that registries do not: highly specific eligibility criteria, and randomization itself. The first defense minimizes the role confounding factors can play, while second helps ensure that the influence of confounders is balanced between comparison groups. Registries, on the other hand, because of their greater need to reflect the diversity of the real-world, cast “a wider net” with their eligibility criteria. In doing so, the room for selection bias–and confounder impact–grows. And because oversampled patient types are not randomized to one or more groups in a registry, they can distort findings more powerfully.

The registry data manager, then, is often engaged in a constant battle against selection bias. She has no more powerful weapon than real-time reporting, which can signal when enrollment efforts need to be retargeted.


Typically, criteria for registry enrollment aren’t as selective as they are for RCTs. That kind of wiggle room leaves the door open for selection bias. Regular, visual reporting of subgroup counts (e.g. patients of a certain race, ethnicity, sex, age, or socioeconomic status) are indispensable to maintaining a registry population that is representative of the general population with the disease, exposure, or treatment under study.

That same real-time reporting, directed now at the site, can automatically prompt CRCs to contact participants in a longitudinal study at the right intervals. Why is this important? Missed visits mean missing data, which poses two risks. The first is a failure to collect enough overall data points to achieve the desired statistical power. The second, more subtle risk pertains to whom the missing data belongs. If a certain patient subgroup is disproportionately more likely to miss visits (and therefore leave blank spaces in the final dataset), results become biased toward the subgroups who were compliant with visit schedules.

Missing data is the scourge of registries. Without consistent outreach to all participants from sites, the data collected can easily be skewed by those participants who are proactive in keeping their appointments. Give your sites helpful, regular reminders of upcoming milestones for their participants.

The takeaway? Look for a data management system that allows you to build clear, actionable reports, and to push them out automatically to sites and other stakeholders on a schedule you set.


The life sciences are awash with data, and yet how little of it flows smoothly from tank to tank. My blood type, and yours, is very likely recorded in a database somewhere. Yet, if either of us participates in a study where that blood type is a variable, we are almost certainly looking at a new finger prick.

The situation is poor enough for RCTs, but becomes dire with registries. Registries that don’t easily consume extant secondary data place increased burden on site staff, who are rarely reimbursed well or at all for their contribution. RCTs, on the other hand, often pay per assessment. Also unlike RCTs, registries make more frequent use of this data:

While some data in a registry are collected directly for registry purposes (primary data collection), important information also can be transferred into the registry from existing databases. Examples include demographic information from a hospital admission, discharge, and transfer system; medication use from a pharmacy database; and disease and treatment information, such as details of the coronary anatomy and percutaneous coronary intervention from a catheterization laboratory information system, electronic medical record, or medical claims databases. – Gliklich RE, Dreyer NA, Leavy MB, editors. Registries for Evaluating Patient Outcomes: A User’s Guide [Internet]. 3rd edition. Rockville (MD): Agency for Healthcare Research and Quality (US); 2014 Apr. 6, Data Sources for Registries.

Clearly, the ability to exchange data among multiple sources in a programmatic way (i.e. interoperability) is a must have for the EDC that will power your registry. Of course, unlike data storage capacity, you can’t quantify interoperability with just a number and a unit of measure. Interoperability is a technical trait that depends on more fundamental attributes:

  • Data standards – Does the system “speak” an open, globally recognized language, such as CDISC?
  • API services – Does the system offer clear, well-documented processes for accepting (and mapping) data that is pushed to it from external sources?
  • Security – Will data that enter, leave, and reside within the system remain encrypted at all times?

Before selecting an EDC, press your prospective vendors on the questions above. Then inquire exactly how they’ll ensure safe and reliable integration between their system and all your data sources.

Researcher ease-of-use

Contributing to clinical research is, for many, its own reward. The prospect of expanding our medical knowledge and, perhaps, improving patient lives, is a powerful incentive. But it’s easy for a clinician or researcher to lose sight of these ideals in the middle of a hectic workday. When the research is long and unpaid, which is more likely to be the case for a registry than an RCT, the will to “get the work done” can quickly trump the will to do it right.

Leaders of registry operations, therefore, have an even greater responsibility than their RCT peers to keep hurdles low. That’s a wide-ranging obligation, but ensuring a frustration-free data capture experience stands at or near its center.

First, a clinical research coordinator (CRC) should meet with no obstacle the tasks of signing in to their EDC and navigating to the right participant. These are the “low bars.” Even so, they can easily trip up thick-client systems, and even web-based systems that aren’t built for performance or designed with UX (user experience) principles always front of mind.

But the most important ease-of-use tests happen in the context of the case report form (eCRF). Recall that a large portion of registry data comes from clinical encounters that occur in the delivery of standard care. Think pulse oximetry, or resting heart rate. Consequently, any eCRF that can’t be completed while in the exam room ought to have you raising an eyebrow. Accept nothing less than forms that render clearly in any browser, on any device (no matter how it’s held). But that’s not all. Fields on the form need to be “smart:” appearing only when they are relevant; capable of showing specific, real-time messages when the entered value is invalid; and hanging on to input even if an internet connection is lost. Finally, these fields should “remember” and calculate for the CRC, instantly pulling in patient data from visits ago to reference in the current form, and effortlessly turning a height and weight into a BMI.


Can’t pull medical history from the EHR? Help your CRC out with fast and responsive autocomplete fields.

In short, contributing to your registry should go hand in hand with delivering excellent patient care and keeping accurate, up-to-date records. The further those drift apart, the more your registry suffers.

Participant ease-of-use

What endpoints are to RCTs, outcomes are to registries. And where there’s a concern with outcomes, there is (often) a concern with patient self-reports. Ergo, chances are high that your next registry may rely on patient-reported outcomes (PRO) as one of its data sources.

If we need to keep the barriers to data submission low for researchers, we need to keep them all but invisible to participants–whileensuring data quality. The simple paper form may appear to offer this balance. Historically, it may have done just that. But twenty years of Internet use have changed our expectations when it comes to offering personal information. Without sacrificing one bit (or byte) of security, we want the same ease in reporting aches to a physician as we find in booking a flight. We want instant “help” when we don’t understand a question, and we don’t want to be asked about matters that don’t apply to us.

Given the expectations above, a study that utilizes even a single PRO instrument can benefit from make the conversion to ePRO. Real-time edit checks, for example, re-orient the participant when their input conflicts with field requirements, without risking the influence of a human interpreter. The time and cost of transcription disappears.

When PRO takes the form of a patient diary, paper’s dirty secrets truly come into the light. Provided the paper form isn’t lost or damaged in the first place, it’s virtually impossible to tell whether a patient made daily diary entries as instructed, or retrospectively wrote responses just prior to a study visit, raising data quality concerns.

As a field, we’ve embraced ePRO for the last decade. But too many ePRO solutions don’t offer the ease or convenience they should. Many depend on provisioned devices, difficult to use and prone to malfunction. Web-based ePRO technologies are a step in the right direction. Here, too, though, industry efforts to deliver a effortless experience often fall short. Special software (such a smartphone apps) require storage space, not to mention the know-how and patience for download, installation, and activation. Along with everything else participants need to remember, is it really fair–or feasible–to add a password, browser recommendations, and “virtual check-in times” to the list?

Won’t be getting you your data anytime soon

The answer lies in allowing patients to use their own devices, be it a laptop or smartphone, and to submit their data on the browser with which they’re most comfortable. Form URLs specially encoded for each participant make passwords unnecessary, while auto-scheduled email and SMS messages provide a friendly, “just-in-time” reminder to make their report. And what better way to convey a message of collaboration with the participant than eConsent? While its role in risky, interventional trials may still be unclear, eConsent is tailor made for registries: it can deliver an interactive education on the purpose of the study, ensure comprehension with in-form quizzes, and signal to registry leaders real-time recruitment trends.

As for ePRO data collection itself, layout, question order, and response mechanism can all make the difference between valid, timely data and no data at all. The participant isn’t an amateur researcher, and won’t tolerate the kinds of screens all of us envision when we think of EMRs. Data collection should proceed from the simple to the complex, leveraging skip logic to trigger only those questions that are relevant, and using autocomplete to help with terminology. A single column layout, a conspicuous progress bar and page advance button, autosave–all of these features are crucial to treating patients like the study VIPs that they are.