NEXT TALK IS GIVEN BY ARI ERCOLE THE DAQCORD TOOL. >> I'M SPEAKING ON BEHALF OF A TEAM OF PEOPLE INVOLVED IN THIS ENTERPRISE, LARGER GROUP OF INDIVIDUALS AND COLLABORATORS, I'LL PRESENT DETAIL AT THE END. MAKING DATA CURATION SOUND INTERESTING IN THE MORNING, I WILL TRY. I THOUGHT PERHAPS A GOOD WAY IS PERHAPS TO SAY SOMETHING CONTROVERSIAL. THERE WILL BE SOME CONTROVERSY HERE. MOST CONTROVERSIAL THING I'M GOING TO SAY HERE IS THAT THE CDE WAS ABSOLUTELY CRUCIAL AND NECESSARY TO ENSURE DATA QUALITY ARE NOT SUFFICIENT, OKAY? IT'S A MORE COMPLEX TASK THAN THAT. DATA QUALITY UNDERPINS THE WHOLE ENTERPRISE OF NOT JUST THE SCIENCE BUT DATA SHARING AND DATA REUSE. I'M A CLINICIAN, A SCIENTISTS. DATA SCIENTISTS ARE EXTREMELY COOL. IN 2016, I WAS WORKING, CAN SEE MY DIARY, ON THE 9th THOUGHT I WOULD LOOK AT DATA. AFTER THAT IT'S BLANK, WE REALIZED -- MAYBE I HAD A PROBLEM. ONE PROBLEM WAS THAT I TOLD MY FRIENDS THIS IS WHAT I DO. AND IN FACT WHAT I WAS GOING TO END UP DOING FOR 18 MONTHS WAS CLEANING DATA. [LAUGHTER] MORE SERIOUSLY, THIS IS A COMPLICATED DATASET, 2,600 VARIABLES AND TIME POINTS, THEY WOULD DERIVE, IT WAS DIFFICULT, BIOMARKERS, GENETICS, NOT PARTICULARLY DEALING WITH IMAGING BUT COMPLEX DATASETS. WE CAN DEAL WITH THAT. SHOULD HAVE BEEN BEEN STRAIGHTFORWARD BECAUSE IT'S DESIGNED PROPERLY, CENTER-TBI USED COMMON DATA ELEMENTS, LOTS OF WAYS TO STANDARDIZE DATA, IF YOU DON'T LIKE THAT THERE ARE OTHER THINGS, DESIGNED PROPERLY. THIS SHOULD BE STRAIGHTFORWARD. RAPIDLY REALIZED IT WENT. THINK ABOUT MISSING DATA. WE'D LIKE TO LOOK AT DATA COMING IN SO WE CAN TAKE APPROPRIATE REMEDIAL ACTION WHEN WE FIND THERE'S SOME PROBLEMS. FIND OUT WHERE IT'S MISSING, IT'S THERE OR NOT. THERE IT WASN'T EASY, SADLY, NOT EASY BECAUSE THE WAY WE RECRUI PATIENTS AND FOLLOW THEM UP OVER TIME REQUIRES SKIP LOGIC, NOT ALL DATA SHOULD BE PRESENT AT ALL TIMES FOR ALL PATIENTS UNLESS YOU CODIFY, YOU CAN'T KNOW WHICH PATIENTS SHOULD OR SHOULD NOT HAVE DATA ELEMENTS AND CAN'T TELL WHAT'S MISSING. WORSE, THINKING ABOUT WHY SOME OF THIS IS MISSING, THERE WERE NO PATIENTS IN THE GOSE 1. THAT SOUNDS LIKE A PROBLEM BUT THESE ARE PATIENTS WHO HAVE DIED AND PEOPLE DON'T COME TO FOLLOW-UP CLINICS, BECAUSE OF THE WAY SEQUENIALLY WE RECORDED DATA THIS NEVER GOT PUT INTO THE DATABASE, IT'S NOT A PROBLEM, JUST NEVER GOT -- WE CAN POST-POPULATE, WE KNOW WHO DIED, DEATH OCCURS IN MANY, MANY DIFFERENT SITUATIONS, MANY DATA ELEMENTS WITHIN THIS COMPLEX DATA SET. IF YOU WANT TO FIND OUT, THIS IS THE STRUCTURED QUERY LANGUAGE THAT YOU HAVE TO DO WITHIN THE DATA BASE. IT'S NOT EASY. OKAY. AND ON THAT NOTE, THIS SITS IN A COMPLEX DATABASE, ANYTIME YOU WANT TO QUERY THAT OR GET DATA OUT YOU HAVE TO START WRITING COMPUTER CODE. I'LL COME BACK TO THAT AT THE END. WE CAN LOOK ACROSS SITES. THIS IS CALCIUM. WE CAN SEE IN FACT CALCIUM VALUES VARY ACROSS SITES, SOME PEOPLE USE IONIZED, SOME NOT IONIZED, THIS IS CALCIUM BUT WE HAVE A BIGGER PROBLEM WITHIN THAT COMMON DATA ELEMENT. OTHER THINGS, THIS IS SOME MEASUREMENTS WITHIN ICU, TWO IN PHYSIOLOGY, DIFFERENT TIMES, 2:00, 4:00, 6:00, SO FORTH, THERE'S A DISCREPANCY BETWEEN WHAT WAS INTENDED AND INTERPRETED. THIS IS NOT A PROBLEM OF KNOWING WHAT ENTITIES MEAN. THIS IS A PROBLEM OF THE ANNOTATION DESCRIPTION OF THE ENTITIES. HERE IS AN EXTRACT STRAIGHT FROM THE CDE WEBSITE, PLACES OF OCCURRENCE OF INJURY, OTHER, FREE ENTRY TEXT FORM. LET'S LOOK AT SOME OF THAT. THIS IS FREE ENTRY FROM INJURY CAUSE WITHIN CENTER-TBI, THERE'S GOOD STUFF, HIT HEAD, FALL, BUT HOW AM I GOING TO DEAL WITH THIS INFORMATION? HERE I'VE GOT AN EXTRACT, 6-METER HIGH AUDIO DEVICE FELL ON PATIENT, FALL FROM TREE, ICE HOCKEY, LLAMA, BULL, COW, HORSES, SKIING, MARTIAL ARTS, PLANE CRASH, ANOTHER PROBLEM, THIS IS MISSPELLED AS WELL. I'VE GOT RUN OVER BY A CULT, I DON'T KNOW IF THAT'S A SPELLING MISTAKE. THIS IS INFORMATION BUT I HAVE NO IDEA WHAT I'M GOING TO DO. WE COULD USE MAGIC WORDS LIKE NATURAL LANGUAGE BUT YOU'RE NEVER GROWING TO FIX THIS. THIS IS UNNATURAL LANGUAGE PROCESSING. THE MORE WE LOOK, THE MORE WE REALIZED IT'S COMPLEX, NOT JUST MISSING DATA. NOT THAT WE DON'T UNDERSTAND DATA. CDEs DO A GREAT JOB BUT IT'S OTHER BITS. WE LOOK INTO VARIABLE INCONSISTENCY, PAIRS OF VARIABLES, A LOT OF POTENTIAL PAIRS OF VARIABLES WHEN YOU HAVE OVER 2,000 VARIABLES, REALLY DIFFICULT, OKAY? SO BY WAY OF BACKGROUND. WHAT WE REALIZED WAS AFTER 18 MONTHS OF THIS PAIN, WHAT WE HAD SEEN CANNOT BE UNSEEN. AND NOW THAT I KNOW HOW COMPLEX THINGS ARE AND WE NOW HOW COMPLEX THINGS ARE I WILL NEVER PERSONALLY TRUST ANY STUDY THAT HAS NOT HAD A PROPER ATTENTION TO DETAIL TO DATA QUALITY. THESE LARGE STUDIES, OKAY? WE HAVE. JUST TO TAKE THAT ONE OFF. IT'S REALLY IMPORTANT, NOT JUST ABOUT MEASURING. THE QUESTION THEN IS HOW WOULD ONE GO ABOUT ASSESSING ATTENTION, TEACHING PEOPLE TO PAY THAT ATTENTION TO DETAIL BECAUSE WE DIDN'T KNOW THIS. WELL, ONE STRATEGY WITHIN THE LITERATURE, WITHIN REPORTING GUIDELINES, THERE ARE PLENTY OF THESE, DO WE HAVE ONE REALLY FOR OBSERVATION, WE DO, STROBE IS THE CLOSEST, REALLY THIS IS MUCH MORE ABOUT STATISTICAL METHODS AND DESCRIPTION OF PARTICIPANTS, DATA AND SO FORTH, NOT REALLY ABOUT HOW ONE GOES ABOUT GUARANTEEING DECENT DATA QUALITY, PREFERABLY ENSURING THAT PROSPECTIVELY, I NEVER WANT TO DO THIS RETROSPECTIVELY AGAIN. WHAT EMERGED? WHAT LED TO DAQCORD? WE WOULD PRY TO PROVIDE CRITERIA, IT WOULD BE POSSIBLE TO DEMONSTRATE YOUR STUDY DESIGN WAS LIKELY TO HAVE GOOD QUALITY, MAYBE LIKE STROBE, PRISMA, CONSORT. ONE WOULD LIKE THIS TOOL TO BE ABLE TO DEMONSTRATE THAT STUDY HAS BEEN DESIGNED WELL TO FUND POTENTIAL TO FUNDERS, BUT ALSO TO DEMONSTRATE THAT THIS WAS A CREDIBLE STUDY AND ATTENTION TO DETAIL HAD BEEN PUT IN BECAUSE THIS IS HAS BEEN A HUGE EFFORT. I HAVEN'T LISTED LARGE TEAM OF PEOPLE INVOLVED IN DATA CURATION WITHIN CENTER, SIMILARLY TRACK, ENORMOUS EFFORT. ANY STUD THAT DOES NOT HAVE ATTENTION TO DETAIL IN CURATION IS NOT CREDIBLE, I WOULD ARGUE. WE WANT TO CREATE A TOOL, ASSIST DESIGN QUALITY FROM THE START. ALSO TO IMPROVE DATA SHARING THROUGH DESIGN DOCUMENTATION. THIS WAS TO REMAIN GENERIC, NOT TIED WITH NEUROSCIENCE, COULD BE SPREAD TO EVERYTHING. THIS IS NOT A CHECKLIST, NOT AIMING FOR ANOTHER CHECK LIST. SO, DAQCORD WAS BORN. WE USED MODIFIED DELPHI APPROACH TO IDENTIFY KEY INDICATORS, INVOLVED STAKEHOLDERS, A LIST FROM LARGE OBSERVATIONAL STUDIES, BIGGEST STUDIES TACKLED THIS PROBLEM. WE HAD THREE ROUNDS, ONLINE, FACE TO FACE THAT TOOK PLACE AT NIH 2018, AND FINAL ROUND REGISTERED WITH EQUATOR NETWORK, YOU CAN READ ABOUT IT. WE ASSESSED CONCEPTS ON THREE DOMAINS, WHETHER OR NOT THE CONCEPT WAS VALID, IT WAS RELATED TO QUALITY OR NOT, WHETHER FEASIBLE, IN OTHER WORDS COULD YOU MEASURE OR QUANTIFY IT, AND WHETHER YOU COULD DO ANYTHING ABOUT IT POTENTIALLY AT THE END OF IT. THE METHODOLOGY HAS BEEN USED BEFORE, INCLUDING CENTER-TBI, FIVE-POINT LIKERT SCALE, ALLOWS US TO ASSESS THE RATING BUT ALSO INTERRATER AGREEMENT, DEPENDING WHETHER SOMETHING WAS RATED, HIGH OR POOR QUALITY, AND WHETHER GOOD AGREEMENT WE COULD TAKE THINGS FORWARD OR NOT BETWEEN ROUNDS. WE DID DO FREE TEXT, TOOK COMMENTS, SYNTHESIZED INTO EXAMPLES. WE REALIZED THIS IS A VERY HETEROGENEOUS PROBLEM. THERE'S NOT ONE SOLUTION TO ANY ONE CONCEPT. SO WHAT WE WANT TO DO IS PROVIDE ACTUALLY EXAMPLES. HERE IS AN ELEMENT, A PARAMETER, SOMETHING YOU SHOULD THINK ABOUT. HERE IS EXAMPLES OF HOW ONE MIGHT, MAY NOT BE APPROPRIATE, BUT ONE MIGHT GO ABOUT THAT YOU'VE ADDRESSED THAT PARAMETER. YOU CAN WRITE CODE AND AUTOMATE ANALYSES, THE BACK END ACTUALLY PLATFORM FOR SIMILAR DATA, STARTED LITERATURE REVIEW, 106 POTENTIAL ITEMS THAT WE WOULD START WITH, WENT THROUGH OUR THREE ROUNDS, AND WE ENDED UP AT THE END WITH A FINAL DAQCORD, 46 ITEMS, A LOT, AND I'LL COME BACK, BUT 46 FINAL ITEMS. I'M NOT GOING TO GO THROUGH IN DETAIL, YOU'LL BE PLEASED TO KNOW. WE'RE WRITING THE PAPER UP AT THE MOMENT. NOT ALL THESE WILL BE RELEVANT FOR EVERY STUDY. YOU PICK AND CHOOSE. SKIP OVER THAT. WE'LL BE REPORTING, YOU'LL BE ABLE TO READ WHAT THE ITEMS LIKE LIKE, WE HAVE CREATED A LIVING DOCUMENT, GUIDANCE, WE WOULD LIKE TO BUILD, MAINTAIN ONLINE TOOL ABOUT THAT REQUIRES BUILD AND MORE MAINTENANCE IN THE FUTURE SO LET'S THINK ABOUT HOW WE MIGHT DO THAT. JUST TO FINISH OFF, 46 ITEMS, THAT'S A LOT. AGAIN LET ME STRESS THIS IS NOT A CHECKLIST. PEOPLE THAT KNOW ME KNOW MY HATRED OF CHECK LISTS. I WORK ON THE AIR AMBULANCE. MY COLLEAGUES PRODUCED THIS MUG FOR ME. THIS IS NOT A CHECKLIST. WE TRY TO PRODUCE SOMETHING COMPREHENSIVE AND EXPANSIVE, ACTUALLY GUIDANCE TO PEOPLE AT THE OUTSET RATHER THAN JUST SOME CHECK. THIS IS NOT AN AREA THAT LENDS ITSELF TO TICKING BOXES. NOT ALL OF THESE ELEMENTS WILL BE RELEVANT TO ALL STUDIES. THAT DOESN'T MATTER. IT'S NOT AN IMAGING STUDIES. THAT DOESN'T MATTER. WE CANNOT ATTEMPT TO ANTICIPATE ALL THE SOLUTIONS, WE WANT TO ENCOURAGE NARROW APPROACH TO PROSPECTIVELY LOOK AT DESIGNING TO FINISH REALLY TO THINK OF WIDER PICTURE, WHERE DOES THIS FIT IN? WE DO HAVE A PROBLEM. THIS IS FROM 2006. THINGS LIKE THIS MAKE ME LAUGH, NOT IN A GOOD WAY. WE ALL THINK REPRODUCIBLE SCIENCE IS IMPORTANT. I'M NOT ARGUING. CLEARLY WE THINK IT'S VERY IMPORTANT. THIS IS ONLY PART. THIS IS THE TIP OF THE ICEBERG. IT REALLY IS THE TIP OF THE ICEBERG. BECAUSE THE DATA THAT WE HAVE, RIGHT, WE MAY HAVE GOOD CDEs, BUT THERE ARE A LOT OF ASSUMPTIONS TAKEN FROM RAW DATA IN SOMETHING WE SHARE WITH PEOPLE, ANALYSES ARE NOT ROBUSTLY TESTED. IF YOU DON'T BELIEVE THAT, YOU MIGHT HAVE SEEN THIS IN THE LAST FEW WEEKS, A RETRACTION IN JAMA, I SHOULD SAY AUTHORS BEHAVED IMPECCABLY HERE. WHAT HAPPENED, RANDOMIZED CLINICAL TRIAL, THEY RECODED. THIS IS THE CODE. THEY GOTLY ALIVE, ZERO OR 1, CODED TO SOMETHING INTERNALLY, 1 OR ZERO, NOW ALIVE-DEAD IS NOW DEAD-ALIVE AND YOU CHANGED THE ASSOCIATION. THIS HAPPENED BECAUSE THIS IS A COMPUTING -- WE HAVE OPERATING SYSTEMS, COMPLICATED COMPUTER PROGRAMS. THERE ARE WAYS YOU CAN AVOID THIS SORT OF PROBLEM. I WAS ASKED TO LOOK AT DATA, TOMOGRAPHY DATA, DOESN'T MATTER. I'M NOT EXPECTING YOU TO READ THIS BUT WE'VE GOT SOME READING DATA IN STUFF, AT THE TOP. DATA CLEANING, RETRANSFORMATION. THIS IS A COMPUTER PROGRAM. OUR ANALYSES ARE BECOMING MORE LIKE COMPUTING PROGRAMS. MAYBE WE SHOULD LEARN TECHNIQUES AND STRATEGIES LEARNING IN SOFTWARE INDUSTRY TO DOCUMENT. HERE'S ANOTHER ONE, MIMIC DATA, I WANT TO GET SOME CALCIUM, MIMIC DATASETS, 200-SOMETHING CALCIUM VALUES. WE HAVEN'T GOT A PROBLEM WITH THE DIFFERENT IONIZED, NON-IONIZED. IT'S QUITE COMPLICATED. DID I DO IT RIGHT? HOW DO I KNOW? HOW DO YOU KNOW I DID IT RIGHT? HOW DO YOU KNOW? IT'S NOT MODULAR. WE SHOULD LEARN FROM SOFTWARE DEVELOPMENT AND WE MAKE A LOT OF CHOICES AND GET ANSWERS, THAT'S WHY THE DAQCORD TOOLS, TESTING, THROUGH TO REPORTING, YOU KNOW, WE MAKE THESE CHOICES AND THESE LIMIT REPRODUCIBILITY. ANALYSIS IS A PROCESS. HOW DO YOU REMEMBER, KEEP TRACK AND DOCUMENT? WE SHOULD USE IT THINKING ABOUT SOFTWARE DEVELOPMENT TOOLS AND DOING THINGS IN A MORE STRUCTURED WAY, WE HOPE DAQCORD WILL HELP. THEIR ARE OUTCOME PREDICTION, NOTHING SPECIAL ABOUT THIS PAPER, BUT THE GITHUB REPOSITORY, READ HOW WE PREPARED OUR DATA, AND HERE YOU CAN EVEN LOOK AT THE DOI TO LOOK AT EVERYTHING WE DID. THIS IS A MUCH, MUCH BIGGER PROBLEM THAN JUST CDEs. FIND OTHER WAYS AROUND THIS, YES, THERE WILL ALWAYS BE MISSING DATA, SMART IMPUTATION METHODS CAN DEAL WITH WHAT'S LEFT OVER. CAN WE DEAL WITH DIRTY DATA? CLEVER MODELS, THIS TAKES 13,000 UNSUPERVISED COMPLETELY UNCLEAN DATA AND BUILDS GOOD PREDICTION MODEL. BUT, THESE SORT OF THINGS WILL NEVER, EVER, EVER BE SUBSTITUTE FOR HAVING DECENT DATA QUALITY, PARTICULARLY FOR THE SCIENTIFIC DOMAIN. THIS WORKS IN HEALTH RECORDS, NOT SCIENTIFIC JOURNALS AWAY. VERY TO HAVE DECENT DATA QUALITY BUILT FROM FROM THE START. IF I LEARNED ONE THING FROM THE DATA CLEANING ENTERPRISE, I NEVER, EVER, EVER WANT TO HAVE TO EMBARK ON A LARGE-SCALE RETROSPECTIVE DATA CLEANING AND DATA CURATION PROCESS. IT SHOULD BE BUILT IN THE START, IT WILL SAVE TIME IN THE FUTURE. IN SUMMARY, THIS IS HARD, BUT IT'S REALLY, REALLY IMPORTANT. THERE ARE METHODOLOGIES OUT THERE, UNFAMILIAR TO MANY OF US IN BIOMEDICAL SCIENCES, WE HOPE DAQCORD WILL HELP WITH DESIGN QUALITY FROM THE START. THANK YOU. [APPLAUSE]