TESOL Quarterly invites readers to submit short reports and updates on their work.
These summaries may address any areas of interest to TQ readers.
Edited by ALI SHEHADEH
United Arab Emirates University
Iowa State University
Diagnosing the Support Needs of Second Language
Writers: Does the Time Allowance Matter?
University of Melbourne
Carleton, Victoria, Australia
University of Melbourne
Carleton, Victoria, Australia
Shenzhen Polytechnic Institute
Ⅲ This study investigates the impact of changing the time allowance
for the writing component of a diagnostic English language assessment
administered on a voluntary basis to first year undergraduates at two universities with large populations of immigrant and international students
following their admission to the university. The test is diagnostic in the
sense of identifying areas where students may have difficulties and therefore benefit from targeted English language intervention concurrently
with their academic studies. A change in the time allocation for the writing component of this assessment (from 55–30 minutes) was introduced
in 2006 for practical reasons. It was believed by those responsible for
implementing the assessment that a reduced time frame would minimize
the problems associated with scheduling the test and accordingly encourage faculties to adopt the assessment tool as a means of identifying their
students’ language learning needs. The current study aims to explore
TESOL QUARTERLY Vol. 43, No. 2, June 2009
how the shorter time allowance would influence the validity, reliability,
and overall fairness of an EAP writing assessment as a diagnostic tool.
The impetus for the study arose from anecdotal reports from test
raters to the effect that, under the new time limits, students were either
planning inadequately in preparation for the task or else failing to meet
the word requirements. The absence of planning time was perceived
to have a negative impact on the quality of students’ written discourse.
Concerns were also expressed that the limited nature of the writing
sample made it difficult to provide an accurate and reliable assessment
of student’s ability to cope with the writing demands of the academic
As discussed in Weigle (2002), the time allowed for test administration
raises issues of authenticity, validity, reliability, and practicality. Most academic writing tasks in the real world are not generally performed under
time limits, and academic essays usually require reflection, planning, and
multiple revisions. A writing task within a reduced time frame without
access to dictionaries and source materials will inevitably be inauthentic
in the sense that it fails to replicate the conditions under which academic
writing is normally performed. Moreover, unless a test task is designed
expressly to measure the speed at which test takers can answer the question posed, rigid time limits potentially threaten the validity of score
inferences about test takers’ writing ability. The limited amount of writing produced under time pressure may also make it difficult for raters to
accurately assess the writer’s competence. On the other hand, institutional constraints are inevitable on the resources available for any assessment. A timed essay test is certainly easier and more economical to
administer, and it can be argued that even a limited sample of writing
elicited under less than optimal conditions may be better than no assessment at all as a means of flagging potential writing needs. Achieving a
balance between what is optimal in terms of validity, authenticity, reliability, and what is institutionally feasible is clearly important in any test
Research investigating the time variable in writing assessment has
produced somewhat contradictory findings, perhaps because of the different tasks, participants, contexts, and methodologies involved and
also the differing time allocations investigated. Some studies suggest
that allowing more time results in improved writing performance
(Biola, 1982; Crone, Wright, & Baron, 1993; Livingston, 1987; Younkin,
1986; Powers & Fowles, 1996), whereas others find that changing the
time allowance makes no difference to performance as far as rater reliability and or rank ordering of students is concerned (Caudery, 1990;
Hale, 1992; Kroll, 1990). Not all studies use independent ability measures (such as test scores from a different language test) or a counterbalanced design that controls for extraneous effects such as task
difficulty and order of presentation (but see Powers & Fowles, 1996).
Investigative methods also differ, with most studies looking only at mean
score differences across tasks without considering the validity implications of any differences in the relative standing of learners when the
time variable is manipulated (but see Hale, 1992). Moreover, most studies have focused on overall scores, based on a holistic scoring or performance aggregates, rather than exploring whether the time condition
has a variable impact on different dimensions of performance, such as
fluency and accuracy (but see Caudery, 1990). It is particularly important to consider these different dimensions when one is dealing with
assessment for diagnostic purposes, where the prime function of the
test score is to provide feedback to teachers and learners about future
learning needs. If changing the time allocation influences the nature of
information yielded about particular dimensions of writing ability, this
result may have important validity implications as well as practical
This study aims to establish whether altering the time conditions on
an academic writing test has an effect on (a) the analytic and overall
(average) scores raters assigned to students’ writing performance and
(b) the level of interrater reliability of the test. If scores differ according
to time condition, this result would have implications for who is identified as needing language support, and if consistent rating is harder to
achieve under one or another condition, then decisions made about
individual candidates’ ability cannot be relied on. Thirty students
each completed two writing tasks aimed at diagnosing their language
support needs. For one of these tasks they were given a maximum of
30 minutes of writing time and for the other they were given 55 minutes.
A fully counterbalanced design was chosen to control for task version
and order effect.
The study investigated the following research questions:
1. Do students’ scores on the various dimensions of writing ability
differ between the long (55-minute) and short (30-minute) time
2. Are raters’ judgments of these dimensions of writing ability equally
reliable under each time condition?
Context of the Research
The preliminary study reported in this article was conducted in the
context of a diagnostic assessment administered in very similar forms at
both the University of Melbourne and the University of Auckland. The
assessment serves to identify the English language needs of undergraduate students following their admission to one or the other university and
to guide them to the appropriate language support offered on campus.
The Diagnostic English Language (Needs) Assessment or DELA/DELNA
(the name of the testing procedure differs at each university) is a general
rather than discipline-specific measure of academic English. The writing
subtest, which is the focus of this study, is described in more detail in the
Instruments section. The data for the current study were collected at the
University of Auckland and analysed at the University of Melbourne.
The participants in the study were 30 first-year undergraduate students
at the University of Auckland ranging in age from 20 to 39 years old.
The group comprised 19 females and 11 males. All participants were
English as an additional language (EAL) students from a range of L1
backgrounds, broadly reflecting the diversity of the EAL student population at the University of Auckland. The majority (64%) were Chinese
speakers, while other L1 backgrounds included French, Malay, Korean,
German, and Hindi. The mean length of residence in New Zealand was
Two experienced DELNA raters were recruited to rate the essays collected for the study. DELNA raters are regularly trained and monitored
(see, e.g., Elder, Barkhuizen, Knoch, & von Randow, 2007; Elder, Knoch,
Barkhuizen, & von Randow, 2005; Knoch, Read, & von Randow, 2007).
Both raters had postgraduate qualifications in TESOL as well as rating
experience in other assessment contexts (e.g., International English
Language Testing System).
To achieve a counter balanced design, two prompts were chosen for
the study. The topics of the essays were as follows:
Version A: Every citizen has a duty to do some sort of voluntary work.
Version B: Should intellectually gifted children be given special assistance in schools?
The task required students to write an argument essay of approximately 300 words in response to these questions. To help students formulate the content of the essays, students were provided with a number of
brief supporting or opposing statements, although they were asked not to
include the exact wording of these statements in their essays.
To ascertain that the two prompts used were of similar difficulty, overall ratings were compared across the 60 essays. An independent samples
t test showed that the two prompts were statistically equivalent with respect
to the challenge they presented to test takers, t(58) = 0.415, p = 0.680.
The rating scale used was an analytic rating scale with three rating
categories (fluency, content, and form) rated on six band levels ranging from 1–6, where a score of 4 or less indicates a need for English language support. Raters were asked to produce ratings for each of the three
categories. These ratings were also averaged to produce an overall score.
To obtain an independent measure of the students’ language ability,
the students first completed a screening test comprising a vocabulary and
speed-reading task (Elder & Von Randow, in press). Based on these
scores, the students were divided into four groups of more or less equal
ability. Then, to control for prompt and order effect, a fully counter balanced design was used as outlined in Table 1.
The writing scripts were presented in random order to the raters, who
were given no information about the condition under which the writing
was produced, so as to eliminate the possibility of their taking the time
allowance into account when assigning the scores. Raters have been
found in other studies (e.g., McNamara & Lumley, 1997), to compensate
candidates for task conditions which they feel may have disadvantaged
The scores produced by the two raters were entered into SPSS (2006).
T-tests and correlational analyses were used to answer the two research
Research Question 1. Do students’ scores on the various dimensions of
writing ability differ between the long (55-minute) and short (30-minute) time condition?
Two different types of analyses were used to explore variation in students’ scores under the two time conditions. First, mean scores obtained
under each condition were compared (see Table 2). The means for form
and fluency were almost identical in each time condition, whereas for
content, the long writing task elicited ratings almost half a band higher
Paired Samples t Tests
Average fluency rating
Average content rating
Average form rating
Average total rating
Note. SD = standard deviation.
Correlations of Scores Under Short and Long Condition
Note. All results significant at 0.01 level (2-tailed)
than those allocated to the short one. Although mean scores for each of
the analytic criteria were consistently higher in the 55-minute condition,
a paired samples t test (Table 2) showed that none of these mean differences was statistically significant. Second, a Spearman rho correlation was
used to ascertain if the ranking of the candidates was different under the
two time conditions. Table 3 presents the correlations for the fluency,
content, and form scores under the two conditions as well as a correlation for the averaged, overall score.
Although the correlations in Table 3 are all significant, they vary somewhat in strength. The average scores for writing produced under the
short and long time condition correlate more strongly than do the analytic scores assigned to particular writing features. The correlations are
lowest for the fluency criterion, although a Fisher R-to-Z transformation
indicates that the size of this coefficient does not differ significantly from
Research Question 2. Are raters’ judgments of writing ability equally
reliable under each time allocation?
It was of further interest to determine if there were any differences in
the reliability of rater judgments under the two time conditions. Table 4
presents the correlations between the two raters under the two time
conditions. Although the correlation coefficients for the short and long
conditions were not significantly different from one another, Table 4
shows that correlations were consistently higher for the short time
Note. All results significant at 0.01 level (2-tailed)
The current study’s purpose was to determine both the validity and practical implications of reducing the time allocation for the DELA/DELNA
writing test from 55 to 30 minutes. Mean score comparisons showed that
students performed very similarly across the two task conditions. Although
this result accords with those of writing researchers such as Kroll (1990),
Caudery (1990), Powers and Fowles (1996), it is somewhat at odds with
Biola (1982), Crone et al. (1993), and Younkin, (1986), who showed that
students performed significantly better when more time was given for
their writing. However, as already suggested in our review of the literature, the differences between these studies’ findings may be partly a function of sample size.
Worthy of note in our study is the greater discrepancy in means for content between the long and short writing conditions. The fact that test takers
scored marginally higher on this category under the 55-minute condition
is unsurprising, given that it affords more time for test takers to generate
ideas on the given topic. In general, however, the practical impact of the
score differences observable from this study are likely to be negligible.
One might argue that shortening the task will produce slightly depressed
means for the undergraduate population as a whole, with the result that
a marginally higher proportion of students receive a recommendation of
“needs support.” However, this is hardly of a magnitude that would create
significant strain on institutional resources and is, in any case, potentially of
benefit in terms of ensuring that a larger number of borderline students
are flagged, thereby gaining access to language support classes.
More important is the question whether the writing construct changes
when the time allocation decreases, because this has implications for the
validity of inferences drawn about test scores. The cross-test correlational
statistics are not strong for any of the rating criteria, and this is particularly true for fluency, implying that opportunities to display coherence
and other aspects of writing fluency may differ under the two time conditions. These construct differences have potential implications for EAP
support teachers who may use DELA/DELNA writing profiles to determine how to focus their interventions. It cannot however be assumed
that the writing produced in the short time condition is a less valid indicator of candidates’ academic writing ability than writing produced within
the long time frame.
As for interrater reliability, the findings of this study revealed (as in
the Hale, 1992 study) that scoring consistency was acceptable and comparable across the two time conditions. In fact, the data reported here suggest that alignment between raters increases slightly in the short writing
condition on each of the writing criteria. Because this finding is not
statistically significant, it is not appropriate to speculate further about
possible reasons for this outcome, but this issue is certainly worth exploring further with a larger data set. In the meantime we can conclude that
shortening the writing task presents no disadvantage as far as reliability of
rating is concerned.
The issue investigated in this small-scale preliminary study certainly
begs further investigation, both with a larger sample, and using methods
not yet applied in research on the impact of timing on writing performance. Procedures such as think-aloud verbal reports and discourse analysis could be used to get a better sense of any construct differences
resulting from the time variable than can be gleaned from a quantitative
analysis. If writing produced under the 55-minute condition were found
to show more of the known and researched characteristics of academic
discourse than that produced within the 30-minute condition, this result
would have important validity implications with regard to the diagnostic
capacity of the procedure and its usefulness for students, teaching staff
and other stakeholders. A further issue, which is the subject of a subsequent investigation, is how test takers feel about doing the writing task
under more stringent time conditions. Although we have shown that
enforcing more stringent time conditions does not make a difference to
test scores, it may be perceived as unfair, making it less likely that students
will take their results seriously and act on the advice given. However, we
could caution that any decision based on these results will, as is the case
with any language testing endeavor, involve a trade-off between what is
feasible and what is desirable in the context of concern.
The authors thank Martin von Randow for assistance with aspects of the study design
and Janet von Randow and Jeremy Dumble for their efforts in administering the test
tasks and recruiting participants and raters for this study.
Cathie Elder is director of the Language Testing Research Centre at the University of
Melbourne, in Carleton, Victoria, Australia. Her major research efforts and output
have been in the area of language assessment. She has a particular interest in issues
of fairness and bias in language testing and in the challenges posed by the assessment
of language proficiency for specific professional and academic purposes.
Ute Knoch is a research fellow at the Language Testing Research Centre, University
of Melbourne, in Carleton, Victoria, Australia. Her research interests are in the areas
of writing assessment, rating scale development, rater training, and assessing languages for specific purposes.
Ronghui Zhang is a lecturer in the Department of Foreign Languages at Shenzhen
Polytechnic Institute, Shenzhen, China. Her research interests are in the area of foreign language pedagogy and writing assessment.
Biola, H. R. (1982). Time limits and topic assignments for essay tests. Research in the
Teaching of English, 16, 97–98.
Caudery, T. (1990). The validity of timed essay tests in the assessment of writing skills.
ELT Journal, 44, 122–131.
Crone, C., Wright, D., & Baron, P. (1993). Performance of examinees for whom English is
their second language on the spring 1992 SAT II: Writing Test. Unpublished manuscript
prepared for ETS, Princeton, NJ.
Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating rater
responses to an online rater training program. Language Testing, 24, 37–64.
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to
enhance rater training: Does it work? Language Assessment Quarterly, 2, 175–196.
Elder, C., & Von Randow, J. (in press). Exploring the utility of a Web-based English
language screening tool. Language Assessment Quarterly.
Ellis, R. (Ed.). (2005). Planning and task performance in a second language. Oxford:
Oxford University Press.
Hale, G. (1992). Effects of amount of time allocated on the Test of Written English (Research
Report No. 92-27). Princeton, NJ: Educational Testing Service.
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How
does it compare with face-to-face training? Assessing Writing, 12, 26–43.
Kroll, B. (1990). What does time buy? ESL student performance on home versus class
compositions. In B. Kroll (Ed.), Second language writing: Research insights for the classroom. Cambridge: Cambridge University Press.
Livingston, S. A. (1987, April). The effects of time limits on the quality of student-written
essays. Paper presented at the meeting of the American Educational Research
Association, Washington, D.C., United States.
McNamara, T., & Lumley, T. (1997). The effect of interlocutor and assessment mode
variables in overseas assessments of speaking skills in occupational settings.
Language Testing, 14, 140–156.
Powers, D. E., & Fowles, M. E. (1996). Effects of applying different time limits to a
proposed GRE writing test. Journal of Educational Measurement, 33, 433–452.
SPSS, Inc. (2006). SPSS (Version 15) [Computer software]. Chicago: Author.
Weigle, S. C. (2002). Assessing Writing. Cambridge: Cambridge University Press.
Younkin, W. F. (1986). Speededness as a source of test bias for non-native English
speakers on the College level Academic Skills Test. Dissertation Abstracts International,
Effect of Repetition of Exposure and Proficiency
Level in L2 Listening Tests
Second language (L2) listening test developers must take into account
a variety of factors such as the characteristics of the input, the task, and