Keywords: validity and consistency in research
Measure is important in research. Solution aims to ascertain the dimension, quantity, or capacity of the behaviors or occasions that experts want to explore. According to Maxim (1999), dimension is a process of mapping empirical phenomena with using system of amounts.
Basically, the happenings or phenomena that analysts interested can be been around as domain. Dimension links the situations in domains to occurrences in another space which called range (Figure 1). In another words, experts can measure certain events using range. The number is comprising scale. Thus, researchers can interpret the info with quantitative conclusion which brings about more exact and standardized benefits. Without measure, research workers can't interpret the data accurately and systematically.
Quantitative Measurement is a quantitative explanation of the events or characteristics which involves numerical measurement. For example, the description made as "There are three birds in the nest". This information includes the numerical measurement on the wild birds. Quantitative measurement allows experts to make comparison between the happenings or characteristics. For example, researchers tend to know who the tallest person in a family group is. So, they use centimeter to assess their level and make evaluation between all the family members.
Levels of Measurement
Level of measurement refers to the amount of information that the changing provides about the trend being assessed (McClendon, 2004). For everyone variables, they will include exhaustive capabilities and mutually exclusive capabilities. These two characteristics are related to the accuracy and precious dimension in a report.
Exhaustive and mutually exclusive qualities. For the exhaustive attributes, it assign a complete range of traits which possessed by every one of the subjects (people) in the study. It enables every one of the subjects in study can answer their preferred answer in each question. For example, a question ask the marital status to all subjects in a study with four options, that happen to be (a)married; (b)divorced; (c)widowed; and (d)never been committed. However, subject who's in the lawfully separated status struggling to choose your options that provided by researchers. Thus, this question is not applying the exhaustive qualities. The categories aren't exhaustive can lead to absent data in the analysis and lastly have an effect on the results of the analysis.
For mutually exclusive attributes, it stated that researchers should assign only 1 attribute for every single person in a report. For instance, a question asked how much of the regular monthly income of each subjects in the study with four options, that are (a) RM0-RM500; (b) RM500-RM1000; (c) RM1000-RM1500; and (d) RM1500-RM2000. The four options in above is not mutually exclusive because each category provides the same income with another category. Topics who have every month income with RM1000 may feel mistake to choose option (a) or (b).
Nominal measurement. You will discover four degrees of measurement, which are nominal, ordinal, interval, and percentage measurement. Nominal way of measuring is a process of assigning numerals to categories. In another words, nominal is at name or form only (McClendon, 2004). Research workers can't describe and change the cases or incidents with using the adjective such as higher than, less than, more than, less than, and others. So, it is the lowest level in dimension. Percentages, central trend, chi square are appropriate found in this level of measurement.
One of an experiment in Community Psychology which conducted by Stanley Schachter in 12 months of 1957 aimed to gauge the affiliation when people feel get worried toward a situation. Experimenter informed all subjects that they will given into two situations which are intense electrical great shock and mild electrical. Then, experimenter asked whether they prefer to wait together with others or enter into the room alone, and which situation that they like to engage in. The effect showed that, themes in intense electronic shock group prefer to hold back together with others and there are no choices in the topics who in moderate electrical surprise group. Experimenter measure the reactions of subjects with using the nominal dimension, that have been (1) prefer mild shock; (2) favor intense distress; (3) prefer hang on as well as other and, (4) no preferences to wait together with others. The final outcome created by the experimenter was depending on nominal measurement.
Ordinal measurement. Ordinal measurement allows research workers to make evaluations like "greater than"; "less than", "greater than", and "less than" but not "how much" (McClendon, 2004). For instance, analysts can make a evaluation according to one of the example in desk 1 as people who firmly agree with legalize abortion is higher than people who disagree with it.
Besides that, features also be rank-ordered. The academics list if students such as freshman, junior, and mature can be an example in this measurement. We can not use arithmetic businesses in this level of way of measuring because the ranges or intervals between your attributes are unknown.
Some process involved with this level of way of measuring such as precedence or preference (Maxim, 1999). aPb, bPc, and aPc indicates a precedes b, b precedes c, and a precedes c. These procedures involve the way of measuring such as "higher than", "greater than", and other. For instance, four students' result position make by utilizing their marks, they are simply 97, 81, 79, and 70. If 97 as "a", 81 as "b", 79 as "c", and 70 as "d", we can make a bottom line that aPb, aPc, aPd, bPc, bPd, and cPd.
Interval dimension. Characteristics of this level of dimension are the features are ordered and the distances between qualities are similar. However, it doesn't have true zero point. The Fahrenheit and Celsius temp scales always used as examples in this degree of way of measuring. Fahrenheit and Celsius temperature scales don't have true zero point because the no temperature does not mean "no heat".
In this degree of measurement, researchers can make a information that 40-50 level is same as 80-90 degree since it has equal ranges between your categories. However, they can not make a explanation like 80 degree is twice as 40 degree because it does not have true zero point. For example, researchers can't say that heat in city A is double hot than city B. As the Table 1 confirmed, 80 level in Fahrenheit actually is same as 27 level in Centigrade, and 40 level in Fahrenheit Level is same as 4 degree in Centigrade, and 27degree is not double of 4 level.
Conceptualization and Operationalization in Measurement
Conceptualization is a process of specifying a term or concept that analysts want to measure. In Deductive research, it can help researchers to specify the theory and turn out with a specific varying that can place in a hypothesis. For the Inductive research, it can help researchers with an idea about what related manners or events that need to be viewed.
For example, research workers tend to measure the influences of social status towards academics performances among children. At first, experts should specify what "social position" is. In conceptualization aspect, interpersonal status can be explained as power, prestige, and privilege. Another example is deviant habit. Researchers have a tendency to study the relationship of deviant habit and academics performance. At the first, research workers have to comprehend this is of deviant habit is. Thus, they described it as the action of smoking, fighting, underage drinking alcohol, and intimidating. At here, an idea will defined without needing any quantitative methods.
Operationalization thought as an activity of defining a thought by solution it (Maxim, 1999). Operationalization is specifying that how a concept in a research be measured. Scoring, coding, and scaling may found in operationalization way of measuring concept. For example, researchers tend to study the partnership between students joy and school performances. School performances range from some shows such as students' examination results, on-time distribution of task, and attendance. In the study, researchers aim to give attention to these three shows. So, they start to make operationalization towards it.
They plan to measure students' evaluation result utilizing the Cumulative of Level Point Average (CGPA) plus they assume that high enjoyment may have high CGPA in university. Then, they measure "on-time distribution of task" by using regularity as just how many times they have submitted project on-time in per month and presume that advanced of contentment may have high consistency of on-time distribution assignment. Lastly, they evaluate students' attendance will depend on the percentage they have got attended to school in a month and assume that higher level of contentment may have raised percentage to attend class in per month.
Difference between Conceptualization and Operationalization in Measurement
Conceptualization is an activity of defining the idea without functions any quantitative methods or others methods that can shows the values of any changing. Conceptualization only can make research workers and population understand what will a term or concept means. For operationalization, it establish a term or a thought and also operates some methods especially the techniques involve quantitative to point the values of the variable.
Indexes and Scales
Indexes and scales are measuring equipment or devices. Both of these used to assess variables or notion that research workers interested. Scale is a cluster of items that established into a unitary aspect or single domain of behavior, attitudes, and thoughts. Scales tend to be specific than indexes do. Scales can forecast final results such as action, attitudes, and feelings because it measures the underlying attributes. For instance, a scale will assess more specific changing such as Introversion. Thus, Introversion size should consist of the things that related to Introversion only. The things in Scale used to measuring Introversion such as 1) I blush easily; 2) At celebrations, I have a tendency to be a wallflower; 3) Staying home every night is all right beside me; 4. I favour small gatherings to large gatherings, and 5) When the telephone rings, I usually let it band at least once or twice. Likert level such as "Strongly AgreeStrongly Disagree" will used. For another example, Hare Self-Esteem Level includes three specific sub-scales that happen to be Peer Self-Esteem Scale, Home Self-Esteem Size, and University Self-Esteem Level. These three sub-scales are used to gauge the idea of self-esteem.
For indexes, it is a couple of items that comprise multiple aspects of dimensions that are interrelated. These entire dimensions will be produced into single signal or rating. Index is more standard than scales. It is also designed for discovering the relevant causes or underlying symptoms of attributes. Indexes have a tendency to measure a concept depends on what goes on in real life.
An index will assess life satisfaction of school students. Because of the reason that life satisfaction may comprise a whole lot of proportions or categories, the index should includes items related to all categories. For example, life satisfaction should include satisfaction of job, satisfaction of family romantic relationship, satisfaction of peer romantic relationship, and satisfaction of marital marriage. Researchers total up the scores of most items and the ratings will reflex the amount of life-satisfaction.
Reliability is important because it enables research workers to involve some assurance that the measure they taken are near to the true measure. Validity is important since it tell analysts that the strategy they taken is really options what they hope it can. So, if analysts wish to know how good the measurement is, they should depend on the reliability and validity of any measurement.
Reliability is synonym of repeatability and consistency. Reliability thought as the amount to which test scores are free from errors of way of measuring (AERA et al. , 1999, p. 180 in Neukrug & Fawcett, 2006). The amount of reliability can decide whether the ratings or data that research workers obtained can be relied to evaluate a varying or develop.
Measurement mistake. An unreliable way of measuring is induced by error way to obtain variability. There are two types of error which are Organized Measurement Problem and Unsystematic Dimension Error. Systematic dimension mistake is the factors that have an effect on measurement systematically over the time. It really is predictable and can be taken away if it gets diagnosed. Additionally it is related to validity of your measurement. Systematic dimension error occurs when researchers unknown to the test programmer and a test measure something others than the trait that researchers have a tendency to assess. These may very seriously impact the validity of any test.
Unsystematic measurement error is the consequences or mistakes that unstable and inconsistent. It really is related to stability of a measurement. Item selection, test supervision, and test rating are types of unsystematic measurement error.
Item selection means that mistake happened in the tool itself. The exemplory case of this problem such as tool which include not valid questions or items, items can't fair to all or any respondents though it is already considered as good, and there are too many items inside the test. Test administration error includes unpleasant room, dim lighting, noises in room, fatigue, nervous, while others which may effect respondents' performances. For the test scoring error, it happened when the format of test not using machine-score multiple-choice items. Subjective view in scoring took place especially for the projective test and essay questions. Rorschah Inkblot Test, Word Conclusion Test, and Thematic Apperception Test are related to subjective judgment.
Types of dependability. There are two major types of trustworthiness which are Trustworthiness as Temporal Stableness and Dependability as Internal Reliability. Trustworthiness as Temporal Stability is related to the times to acquire data. Stability as Temporal Stableness includes Test-retest and Alternate-forms Reliability. Internal Reliability includes Split-half, Coefficient Alpha, and Interscorer Stability.
Test-retest reliability thought as the relationship between scores from one test given at two different administrations (Neukrug & Fawcett, 2006). Alternate-forms Consistency is the partnership between the scores from two version of same test. In this kind of consistency, everything in different version test such as the difficulty level, variety of items, and content should be same. Split-half stability defined as correlating one-half of the test to the other half. Researchers can separate the test into two parts that are first 50 % and second half. In addition they can divide the things by odd quantities and even amounts of the items. Spearman-Brown used when the amounts of items in test is short. Spearman-Brown is more appropriate when the numbers of items is few.
Coefficient Alpha and Kuder Richardson dependant on correlating the ratings of every item with total scores on the test. Kuder Richardson used when the items need to be answered by "yes" and "no". Interscorer Trustworthiness thought as correlating the scores from several observers' score to the same happening. Observers should be trained to ranking on the happenings or behaviors of respondents.
Test-retest is appropriate be utilized when researchers try to measure the actions of respondents across times. Coefficient Alpha is suitable to be utilized in both unidimensionality checks. Divide the test by unusual and even figures is appropriate to be used when the down sides of items have carefully bought. If the difficulties level of items is not carefully requests, the technique of divided the test to first 50 % and second 1 / 2 is suitable. Interscorer trustworthiness used when the test involves subjectivity of rating.
Validity refers to an accuracy of the measure. A way of measuring is valid when it actions what the analysts suppose to measure (Gregory, 2007). For example, IQ assessments are likely to measure cleverness and depression assessments are likely to measure despair level or symptoms of respondents. Normally, the inferences attracted from a valid test are appropriate, meaningful, and useful.
Types of validity. There are three types of validity that happen to be Content Validity, Criterion Validity, and Build Validity. For the Criterion Validity, it offers Predictive Validity and Concurrent Validity. For the Construct Validity, it includes Convergent and Discriminant Validity.
Content validity determined by the degree to which the questions, responsibilities, or items over a test are rep of the world of behavior the test was made to test (Gregory, 2007). The appropriateness of content of a measurement is determined by experts. Analysts make a wisdom on if the items in a way of measuring have protected all domains that they want to measure. For example, teacher would like to create a test which will measure the knowledge of students toward a topic from section 1 to 5. The type and quantity of questions were created. Sixty multiple-choices questions and 60 minutes receive to the students to do the test. Ten questions will cover each section and the others questions will cover section five which considered as the most important chapter in the test.
Validity of content also can be made by the experts' score towards each item to decide if the items can suggest this content or not. Two experts assess each item on the four-point level. The rating of each expert on each item can be dichotomized into weak relevance of content (ranking of 1 1 and 2) and strong relevance of content (ranking of 3 and 4). If both industry experts agree that that is highly relevance, then your item will be placed in cell D; if both experts agree that the item has poor relevance, that will be put in cell A. Cell B and C engaged the items that agreed by one expert and disagreed by another expert (Body 2).
For the Criterion validity, both Predictive and Concurrent validity will created by assessing them with others criterion. Concurrent validity correlates test results with criterion scores and these two types of ratings are obtained in the same time. For example, experts would like to gauge the reading potential of students utilizing the Reading Success Test. Researchers compare the Reading Success Test ratings of students with the teachers' rating scores on students' reading expertise. High correlation between your two scores shows that there surely is high concurrent validity in the test.
For the Predictive validity, it correlates test scores with criterion scores which obtained in the foreseeable future. It means that the results or data are obtained in several time. For instance, Work Test used to measure the performances of worker in a firm or organization. Initially, researchers give the test to staff and after half a year, the supervisors asked to provide evaluation to the shows of staff. Then, experts compare the test results and supervisors' ranking scores to see the level of validity. The difference between Concurrent and Predictive validity is enough time frame used to obtain the data and scores.
For Build validity, construct is a theoretical, intangible quality or characteristic where individuals differ. It really is abstract and hard to be assessed. Thus, it needs some indications or indications to stand for it. A construct is a collection of related manners that can signify the things that researcher want to assess. Build validity is information that an idea or concept is being assessed by way of a test (Neukrug & Fawcett, 2006).
For example, despair is a construct and it manifested by some conducts such as lethargy, difficulty concentrate and lack of appetite. Homogeneity identifies a test strategy a single build. Homogeneous refers to the single element or subtest in a Homogeneity test. The purpose of homogeneity is selecting goods that potential to create a homogeneous scale.
Convergent validity thought as a test highly correlates with other variables which have same or overlap constructs. For example, researchers wish to take the Beck Despair Inventory-II (BDI-II) to compare with others tests that have same variables as well. The result implies that, BDI-II has high relationship with Level for Suicide Ideation (r =. 37); Beck Hopelessness Scale (r =. 68); Hamilton Psychiatric Score Scale for Despair (r =. 71); and Hamilton Score Scale for Panic (r =. 47). Lastly, for the Discriminant validity, it means a test will not correlate with the parameters or test which are not measure different variables or constructs.
- Relationship between Stability and Validity
A good validity need to have good reliability established first. However, a good consistency does not lead to a good validity. An excellent reliability only reflex that the scores in a measurement is appeared regularly.
A good validity may causes reliability. Once the measurement or test tends to measure what analysts tend to evaluate, the validity occurred and thus the reliability happened also. Within a test, reliability is necessary but not sufficient for validity. In other words, solution can be reliable however, not valid; valid methods must be reliable, however.