CDS for Non-Data Scientists Part 2: Resources for Bridging the Gap

59,000,000,000,000,000,000,000 bytes. That’s 59 zettabytes or 59 sextillion pieces of discrete information. Don’t feel bad; I had to look up those words too. That is the total amount of data estimated to have been generated through 11:59 pm on December 31, 2020, according to the International Data Corporation, since the start of the digital age.

Quantity Magnitude SI Prefix
Data on Planet Earth 1021 Zetta-
Rough Estimate of the Number of Sand Grains on Earth 1018 Exa-
Current Estimate of the Number of Stars in the Universe 1021 Zetta-
Estimate of the Drops of Water on Earth 1024 Yotta-

Assume for a moment that you were able to magically download all that data onto a single computer. It would take 3,277,777,778 of the largest currently commercially available hard drives (18 TB), each costing approximately $1,000 at this posting, for a grand total of well over 3 trillion dollars. Again, let’s say that, through science or magic, each HDD was the thickness of one of a US dollar bill; they would form a stack 223 miles thick. One more figure because I am having WAY too much fun [OpenClinica asked a data nerd to write about data, so they should have known what they wrought]. In roughly a century (give or take a decade), humanity is expected to have generated more data than atoms on Earth.
Fake Government NASA Doc

In a recently leaked document, NASA has expressed interest in converting Mars into the largest data center in the solar system.

With all this data, it’s hardly a surprise that data science is one of the “it” careers and that more and more career paths need to be data fluent.

For several years now, I’ve held a top-down view of data science that might make me a lot less popular: learn data science, and I mean really learn it, and clinical data science will be a snap [mostly applying principles you already know; after all, to a computer, an int data type is pretty much an int whether it’s a count of the number of TVs sold at a store or a heart rate], the inverse is rarely true. I once had a colleague who is a statistician (as in real, Ph.D. holding, with more Greek symbols on his whiteboard than English type of statistician) who convinced me of a similar rationale with regards to statistics vs. biostatistics.

What makes a good data scientist?

If you are reading this blog, especially at 10:30 at night or while you munch away on lunch, you probably already have some of the qualities of a data scientist! Do you have an innate, some might argue pathological, desire to understand how things work? Are you infamous for your attention to detail? Obviously, success in data science is a bit more complicated and nuanced. Still, at the heart of it, all data science is driven by a desire to use information to improve decision-making. No knee-jerk decisions or gut feelings are allowed here. Those devils and angels on your shoulders can stay home too. So if you fit into these criteria, read on because you may have just found your calling. If you don’t, read on anyway.

How do I break into the data science field?

This is a path that I just took, perhaps the second or third step on, so please, please don’t treat this as an exhaustive stay inside the lines, type article. Your path is yours alone. I can only offer some guidance and helpful hints that I’ve found along the way.

The first question you really need to grapple with is how much you want to get into data science. That isn’t meant to be derisive or anything of the sort. You’ll learn, if you haven’t already, that virtually everything in life has a cost. Is that super-specialized Ph.D. program worth the 5-7 years of work and time away from the workforce, not to mention the late nights staring at the computer screen? Maybe it is, and if so, go for it, but you still need to ask yourself if it’s worth it to you.

In the broadest sense, entering or advancing in the data science field takes the form of formal vs. informal training. Traditionally, formal training takes the form of an advanced degree of some type, while the informal is far more self-driven and created by you. Although neither is inherently better than the other, they both have some positives and some drawbacks, as you’ll soon see.

Back to school: is formal higher education suitable for you?

Theoretically, you can still enter the data science field with a bachelor’s degree and strong math, science, and programming background; however, those days seem to be limited. As more and more institutions have started offering advanced degrees in data science, the expectation that a serious candidate should have post-bachelors training has increased. Therefore, if you decide to invest the time, money, and effort into a graduate degree, you should know a few things first.

First, before you set foot in a data science class at a major university, you are usually expected to have completed:

  • three semesters of calculus,
  • one semester of linear algebra,
  • freshman computer programming,
  • and possibly differential equations and/or upper level statistics.

Not only that but you’re expected to remember them. I know, I was shocked too. So if you don’t have those classes on your transcript or you don’t recall how to calculate the multiplicative inverse of a matrix, you should probably brush up on those topics. More on that later.

This is my second attempt at a graduate program, so some advice for those considering a research-based degree. First, you should plan on spending roughly 4 hours every week studying and preparing for every 1 hour of coursework on your schedule. Sometimes this can be more, rarely less, depending on the specific course and your background. Then if you have teaching duties, plan on another 15 or so hours per week teaching and grading. Oh, then you have research and writing you’re expected to do, so budget another 15 hours for that. Oh, I almost forgot lab group meetings and any administrative responsibilities you might have. To train graduate students how to run a lab, faculty often delegate to their students. Finally, as you continue on in any program, you will usually have mentoring and leadership roles to take up your few remaining hours of freedom. Can you somehow sleep a negative number of hours??

A word to all the “Gentlemen C’s” out there, know that you are generally expected to maintain a 3.0 GPA in any graduate program.

Now that I’ve gotten most of the bad stuff out of the way, now the good. If you like those kinds of environments and have a deep desire to learn, you may well have the time of your life. You’ll often be working on the cutting edge of science and technology and working with people who literally wrote the book in their subject. There is also an increasing number of online degree programs and ones designed specifically for working professionals.

Notice a couple paragraphs earlier I mentioned research-based degrees. Usually, you can tell it’s researched-based by the degree’s letters: M.S. and Ph.D. On the other hand, professional degrees are more likely to be MPH, MDS, or any number of other acronyms. Professional degrees will usually be more practically based and, perhaps more importantly to you, may require less in the way of prerequisites. There is always a tradeoff though, professional degrees typically have requirements for work experience, usually 3-5 years. For instance, the degree I’m currently pursuing (a professional degree) doesn’t require any specific courses on your transcript BUT you are still held accountable for those skills and knowledge as if you had just taken those prerequisites I listed earlier.

A few words about online graduate degrees

Yes, there are a few high-quality online degrees out there, but it is often buyer-beware. Some things you want to look for are:

  1. Are they attached to a traditional, physical university? And if so, do they offer an approximate equivalent “on-campus” program? This is always a plus because such universities must maintain a certain standard or risk losing their accreditation. There is also a greater likelihood that you will be dealing with quality faculty.
  2. Will your degree or transcript say “Online” on them? Honestly, this is a very subjective issue. For me, concern 1 is far more critical than this. Still, if you’re going to put in 2-5 years of work on a master’s degree, you don’t want to risk coming out the other side with it being perceived as worth less than a “traditional” program.

Eh…got anything else???


One step below in intensity from an official degree are boot camps, webinars, certificates, and professional development programs. These can be very practical options, allowing you to demonstrate your talent with a type of real-life legitimacy that formal degrees don’t inherently impart.

Lastly, just taking time to read and learn about a given subject can be hugely beneficial. Perhaps you don’t need to know every nook and cranny of data science all at once. Maybe you can start by asking a single question or picking a specific topic. For example, “How do I use SQL to program a database?” or “What is this ‘GitHub’ thing I keep seeing in my Google results?” Then you can expand your knowledge from there.

If you consider that the true pioneers of any field had to discover it and make it up as they went, there’s no reason why we couldn’t learn the same way.

Some (but far from all) US Specific Programs

Traditional Degrees (MS, PhD, or both)

  • University of Arkansas Medical School [on campus certificate, masters & phd in biomedical informatics; including a specialization specifically in clinical research informatics]
  • University of Colorado Boulder [on campus MS-DS]
  • Indiana University Bloomington [several on-campus masters & ph.d programs in bioinformatics and data science]

Online Degree Programs

  • University of Colorado Boulder [online MS-DS]
  • Indiana University Bloomington [online MS in Data Science]
  • University of New Hampshire [online MS in Health Data Science]

Other Educational Oportunities

  • SpringBoard [online data science bootcamp]
  • University of Colorado Clinical Data Science Specialization [Certificate on Coursera]

Some Resources that I like

Addendum: My Data Science Pathway (a personal story)

So, now that we’ve covered all that, I’d like to give you a peek into my journey so far to becoming a data scientist. Keep in mind, we could be polar opposites for all I know, so you may favor a different path, and I certainly have no patent on this method, so feel free to rip it off 100%.

First, it’s been a while since I had any type of formal schooling. I left my original graduate program in 2012 and haven’t had much exposure since then. So I knew I needed to brush up on my background knowledge. Running the occasional ANOVA test since then hardly demonstrates everything you need to know. To refresh my math & statistics chops, I chose Though there are plenty of similar websites, I picked the one that matches my learning style and budget. I generally try to do a lesson (they call them quizzes) per day.

I knew I also needed to improve my programming skills within data science. I feel pretty comfortable with C++, C#, and Java, for instance. Python, not so much. It’s just a question of exposure to the language for me, so I picked This site is a dependable resource because it deals specifically with data science applications. It costs admittedly a little more than I wanted to spend, but I found a nice 3-month free subscription. Check out GitHub Student Developer Pack, where all I needed was a valid school email address.

One last preparatory tip I’ll offer is to make a cheat sheet. No, not to actually cheat. Virtually every language and package (you can commonly have 10 or more active at any time in real-life applications) has slight variations between them. So I’ve found it really is impossible to keep them all straight at once. Case and point, perhaps you’re working with 2 different SQL databases at once, one programmed with PostgresSQL the other in T-SQL. There is just enough difference between those two to cause you headaches, so maintaining a 3-4 page quick reference can be a life-saver. I’ve found the ones from to be a solid starting point, but mine usually end up covered in post-it notes all the same.

I made good use of Kindle and YouTube with virtually every other topic to make sure I was solid. This leads me to another pearl of wisdom about graduate school: you can’t be too prepared. Transitioning from an undergraduate (or what you remember) to a graduate program is like suddenly being drafted into a professional sport. The level to which you are expected to perform is high, and the learning curve is steep.

For my actual master’s program, I chose one that launched only recently. The University of Colorado Boulder has an online version of their on-campus master’s degree that is a solid fit for what I want. I also have found I do better with shorter courses, and this program has academic terms of 2 months. To a good approximation, they take each 3-hour class in the on-campus degree and chop it into 3 pieces. It is also 100% asynchronous; hence as long as the coursework is completed by the deadline, I can watch the lectures and do the work whenever I can fit them in.

Clinical Data Science for Non-Data Scientists (Part 1)

Einstein is often attributed to saying, “If I had an hour to save the world, I would spend 55 minutes defining the problem.” In that spirit, we begin by defining what we mean by clinical data science, at least for the purposes of this blog series. CDS is the application of data science to medical and therapeutic decision-making, including research applications. Simple, right? So far, so good?

Well, now comes the hard part: defining “data science” itself as even seasoned professionals can be inconsistent with terminology. Is a given example best characterized as data science, or data analytics, or perhaps business intelligence? Do you need a data visualization expert? Perhaps we should just give everything to a statistician and hold our breath? These distinctions are more than purely academic, as finding the right person with the right skillset can seriously impact your outcome.

Before going any further, it is worth noting that for historical reasons, as much as anything, many of these job titles overlap, and many who perform them could have almost identical skillsets. The definitions and distinctions offered here may be one of many frameworks possible, but they seem to represent a plurality and perhaps the early signs of a convergent system. To help unpack these concepts, let’s examine them through the lens of when a given question would likely be addressed.

  • A statistician typically looks at data through a hindsight lens. They will generally try to answer questions about what happened and frame it in relation to attempting to pinpoint a true value.
  • A data visualization expert creates dashboards and similar tools to help decision-makers interpret data faster and easier.
  • A business intelligence expert will try to pull information from multiple sources, combine it with a knowledge of business operations, and make decisions largely based on the business’s own interests.
  • A data analyst will typically analyze past data, often with some statistical inference techniques, to make predictions and decisions about future events.
  • A data scientist uses real-time, or near real-time, data to make decisions through largely algorithmic techniques.

As you can see, the difference between the definitions of data analyst and data scientist is subtle, but the key to understanding the difference is that a data scientist’s approach will tend to be much faster to adapt than a data analyst. In essence, the data analyst may be examing trends a month, a quarter, a year, or more old. This is not to say that such analyses are not useful – one of the key business metrics is a year-over-year performance analysis which is exclusively in the purview of a posteriori analysis.

The advent and evolution of clinical data science as a discipline includes some exciting possibilities for clinical trials. A greater emphasis on adaptive trials and real-world evidence can increase the speed of trials and the validity of results. In fact, as data shows in some diseases, especially those with a strong genetic component (such as oncology or auto-immune diseases), an adaptive clinical trial can often decrease time, costs, and risks to both subjects and sponsor alike (Pallmann, Philip. Adaptive Designs in Clinical Trials: Why Use Them, and How to Run and Report Them. 2018). Umbrella, basket, and platform trial designs have all seen an increase in the past decade as:

  1. knowledge of molecular genetics has increased,
  2. the cost and difficulty of molecular techniques have decreased,
  3. and the complexity of interim analysis has decreased.

A knowledgable clinical data scientist could even conceivably program a majority of interim analyses to run repeatedly (using an appropriate correction for multiplicity issues) which would decrease downtime that can hinder adaptive designs. This approach could also end trials that have become unproductive or unnecessary earlier than in a traditional approach.

Almost all parties at all levels in clinical trials can benefit from the use of data science. Industry sponsors could see the most direct benefits in both cost and time reduction. Traditionally a large pharmaceutical company may value data science mostly to achieve time savings. In contrast, smaller start-ups and midsize companies might see value primarily on the side of cost savings. A sponsor can reduce the resources required to bring a product to market by applying a more adaptive design to trials that can support a greater frequency of interim analyses without the traditional overhead of scheduling a database freeze, completely resolving queries, and having a small army of statisticians and programmers spend weeks only to then have to present the data to a DSMB or similar body.

A clinical data scientist could theoretically preprogram all the necessary analysis and utilize any number of machine learning techniques to mitigate many unforeseeable circumstances, such as missing or erroneous data, outliers, or noncompliance. Machine learning, coupled with Natural Language Processing and search engine spiders, a.k.a. web crawlers, could conceivably enable sites and sponsors to monitor web forums, messages within EHR systems, and many more systems for SAEs or underreported adverse events. Similarly, a clinical data scientist could use emerging technologies to gather and process data to a degree that even 5 years ago would have seemed impossible. While technology continues to evolve at a substantial pace, it seems likely that humans will always be part of the process; therefore, it is unlikely that overhead can ever be reduced to zero, but data science can greatly reduce it.

CROs and SMOs tend to take a more business intelligence approach to clinical data science. It makes sense that if site A sees 5x more lung cancer patients than site B, then it should reasonably follow, all things being equal, that site A should get the phase III trial for a novel targeted lung therapy. This is an oversimplified example, but it illustrates the point that data can and should inform business decisions wherever possible. If a CRO notices a sharp uptick in the number of queries for a given site, they might arrange for retraining at the next monitor visit to increase compliance.

Regulatory agencies and quality assurance departments can utilize data science to increase the effectiveness of risk-based monitoring programs and the distribution of routine audits to increase efficiency, which can increase the effectiveness of these programs. The US FDA’s BIMO (Bioresearch Monitoring) program already uses some algorithmic approaches to determine where to send inspectors. The next logical evolution would likely be to incorporate machine learning to make the algorithm more competent. Perhaps it would incorporate Natural Language Processing to see trends in FDA 3500 forms (for safety reporting) or even social media groups discussing research experiences.

Research sites, too, can utilize data science methods to increase their own efficiency. Let’s be frank; we’ve all at least heard of sites that will apply to join every study, even remotely applicable to their practice. It’s understandable – in the purely academic realm, the term of art is “publish or perish,” and if you don’t stay in the black, then you won’t be a site for very long. This shotgun approach to research participation can have unintended consequences, however. Even something as simple as completing a study application form can take hours, contracts can take days, and study start-up can take weeks. All this time is, as they say, money. A data-driven approach has the potential to guard against this tendency.