In healthcare, information technology, financial services, or any high-stakes profession, when assessments fail it usually is because critical stages in development were skipped or glossed over.
Failing an exam may prevent someone from advancing in their career as a physician, chiropractor, engineer, pharmacist, financial planner, or network administrator. The same is true for other careers as well. Consequently, skipping steps in development and validation puts the testing body at extreme risk, not to mention consequences to individuals when exams are biased or fail to accurately measure competence.
A common benchmark for exam validity is an exam’s “defensibility,” meaning that results would stand up to a legal challenge. Meazure Learning designs, validates, and deploys these kinds of high-stakes credentialing assessments, and has published seven detailed white papers about creating highly valid, reliable, and defensible exams. I’ll briefly touch on the seven stages here, but for more information download the seven papers here.
Exams generally follow a standard 7-stage development cycle
When an assessment is deemed to have substantial evidence of validity—as all high-stakes exams should be—it means that when any individual or organization challenges the exam in a court of law, a defense based on evidence that the exam accurately and fairly measures what it claims to measure would hold. To ensure this, development follows seven stages (Figure 1).
Figure 1: The Seven Stages of Assessment, a series of white papers by Meazure Learning, explains the full lifecycle of developing assessments that are valid, reliable, and defensible
1. Define target and create test specifications
2. Develop items meeting those specifications
3. Assemble forms
4. Administer exams
5. Analyze results
6. Score and report
7. Set the standard to pass the exam
Obviously, there is no room for error in the scoring phase of an assessment. However, when the results fail to be defensible, problems in an earlier stage may be to blame. Here are some of the most common mistakes.
Stage one may be the most important because it lays the foundation. A thorough and consultative research project should clearly delineate what will be assessed and how. A common pitfall is to either skip this step entirely or not do this step when significant changes occur in the industry.
Once this foundation is set, stage two is where high-quality questions are written. Given the technical nature of most high-stakes assessments, this involves subject matter experts (with technical knowledge) partnering with assessment developers (with psychometric knowledge). A diverse pool of item writers providing broad technical and cultural perspectives is vital here.
In stage three, the exam authority assembles the questions into a cohesive whole. The primary challenge is building an exam capturing the themes, elements, and proficiency goals that accurately measure what needs to be measured. Multiple iterations of the assembling, reviewing, and reassembling process may be required.
When it is time to administer the exam (stage four), the experience must 1) allow candidates the best opportunity to demonstrate competence and 2) include stringent security measures. Each high-stakes exam question typically costs $1,000 or more to produce, meaning a 100-question exam could be a $100,000 investment. Theft must be prevented. Additionally, there is a risk of incompetent practitioners being given a license or certification, as well as the potential for severe reputational damage to the credentialing organization.
In stage five—item and test analysis—evidence is collected that the individual questions and the entire exam performed as intended. Questions are analyzed to ensure a proper level of difficulty and differentiation between competent and not-yet-competent candidates. Assessments must show evidence of validity and reliability. Only then can a credentialing body have confidence that they are passing and failing the correct people.
In stage six, candidates are scored and reports are created, which requires 100% accuracy. Many organizations use double-scoring whereby candidates are scored twice independently. Once scores are generated, reports detail candidate and group performance that must be clear, concise, and digestible by the intended recipients.
In stage seven, a “passmark” is set and finalized. Contrary to popular belief, passmarks for high-stakes exams should not be policy-based (a pre-determined passmark of 65%) or norm-based (“bell curving” where a pre-determined number of candidates fail). The passmark should be set using a rigorous criterion-based approach that will withstand future challenges.
While many of these trouble spots may seem easy to overcome, in practice they can be quite difficult to solve.