As is well known, current research identifies two main data collection instruments to obtain information on the intensity with which an activity is carried out: the scale of frequency or stylized item (hereinafter also simply referred to as item) and the time use diaries (hereafter simply referred to as diaries). In the former case, respondents are generally asked to indicate how frequently they perform a given activity within an established time, which could be a week, a month, or more commonly a year. In the latter case, typically the respondents compile a daily diary in which they note down which activity, or activities, they carry out at established intervals, typically 10 to 15 minutes, and in which place. Both data collection methods have their advantages and disadvantages.
Using items, it is possible to carry out surveys on the distribution of the intensity of a given activity over relatively long time cycles. However, such information can be subject to bias, and, in fact, for many surveys is one of the most important sources of error (Biemer, 2010, p. 823).1 Often the use of items doesn’t allow to collect data in an accurate manner (Kan & Pudney, 2008) because recalling what activity took place, how often it occurred, and how long it lasted after some time has elapsed is rather difficult. Moreover, they are disproportionately prone to social desirability or demonstration effects because the indication or omission of having carried out a given activity involves, at most, passive action (Gershuny, 2003, 2012). Finally, the problem of the low precision level of the stylized item is exacerbated when the study addresses specific issues, such as those related to some form of obligation or expectation — also in moral terms — regarding the anticipated behaviour, such as in the case of religious behaviour (Presser & Stinson, 1998).
Instead, diaries can be used to collect more reliable information. Their chronological structure makes it easier to record the timing and recollection of events (Belli, 1998). Moreover, any bias due to memory gaps during the compilation phase is generally limited (Al Baghal et al., 2014), a distortion which can be further reduced through containment strategies such as, for example, providing options to log activities as soon as possible, using smartphones and tablets (te Braak et al., 2023), and properly training interviewers on techniques for “retrieving” memory lapses (Kirchner et al., 2018). Notably, unlike with items, falsification requires episodes to be actively invented. Consequently, as honesty is the easiest behaviour for respondents, there are fewer desirability effects (Gershuny, 2012). However, the data covers a limited reference time, as almost nothing is known about the distribution of the intensity with which the given activity is carried out in the period when the diary is not updated (Scappini, 2010).
To overcome the difficulties arising from these two data collection methods, researchers have suggested finding a model that can calibrate the values obtained from items with those collected from diaries in such a way as to obtain data as complete and reliable as possible. This requirement has led to many attempts by several scholars — above all by Gershuny — to make combined use of the qualities of the two data collection tools. Kan and Gershuny (2009) showed that it is possible to calibrate items by combining two datasets: One derived from a survey that collected questionnaire and diary data from the same respondents, and the other from a questionnaire-based survey. This method was then developed using the latest matching techniques (Borra et al., 2013; Walthery & Gershuny, 2019).
The disadvantage of these approaches is the use of unusual theoretical concepts and relatively complex regression techniques. In contrast, a simpler solution is the one presented in Scappini (2021), which is also, however, not without its problems. While the Linear, and conclusive, model presented may be attractive in that it allows, with the information derived from the diaries, for the easy calibration of items by means of a function that has a gradual and not discontinuous development, as will be seen, it also introduces a possible bias factor. Such distortion could become relevant in certain situations, a problem for which a solution is proposed with the refinement of the model under consideration here.
In order to present some practical applications of the method, data from the Time Use Surveys, conducted in Italy in 2008, in which there is a questionnaire that includes an item on frequency at mass and a daily diary, will be used.2 Indeed, it is well known that surveys that include items that detect religious behaviour are particularly affected by distortion that can be both very high and very inconstant (Hadaway et al., 1993; Presser & Chaves, 2007; Presser & Stinson, 1998; Rossi & Scappini, 2012; Scappini, 2018). In contrast, diaries in the face of substantially correct information do not allow us to identify, or delimit, within the community surveyed, the subgroup of regular practitioners, as variously defined.3 The application of this method, to this kind of data, has the peculiarity of being able to show the important level of bias that can be generated by the items and allows to overcome the diary’s limit.
The paper is organized as follows. The following section presents the data and discusses the different characteristics of the indicators. The next one reviews the existing models and the reason why their application may produce calibrations subject to bias, then goes on to present two new models, which are the focus of the present paper. Finally, in the last section, some comparisons will be made using real data.4
Diaries and Stylized Items: Indicators With Different Characteristics
The Indicator Provided Via the Diary
The following is a description of the indicator that is derived from the use of a daily diary. After determining the total number of subjects who will have to keep a diary (N), we will build sub-samples, each composed of N/D individuals. The number of sub-samples is equal to D, which is the period (defined in days) during which the diaries are kept. Typically, the period D is equal to 364 days.5
We can now organize the data in the form of a matrix composed of N/D rows and D columns and calculate the ratio with — i.e. the proportion between positive events () and possible events (N) — (Rossi & Scappini, 2014). As will be seen, there will be a need to decompose the overall value P into I subgroups. In this case we will identify different values of with . We will call the statistics and with the term measured presence (hereinafter also just presence).
However, is a “poor” indicator of information because, as we have seen, the subjects surveyed on the various days belong to different sub-samples. It is therefore not possible to select the part of the population that carries out a given activity within a specific range of intensity for periods longer than the extension of the diary.
The Indicator Provided Via the Stylized Item
Daily diaries are not always sufficient to provide an adequate answer, as there are activities with relatively long time cycles. In such cases, it would be necessary to select the part of the population that carries out a given activity over a longer period of time: for example, one week, one month, or longer intervals.
The typical solution to this problem is to employ a questionnaire with a suitable item that can be used to determine how frequently each subject performs a specific activity over a given period, which is usually one year. Ideally, if n is the number of days in the given period (generally n = 364), the n + 1 values can be calculated, each of which provides the number of people attending mass times, with t ranging from 0 to 364. Each ratio , where N is the size of the sample, provides the daily attendance rate for each single value of t. This can also take the form of a cumulative rate to indicate the proportion of people who perform an activity at least t times per year — , .
Having defined the indicators derived from the use of the two survey instruments, we now need to make these measures comparable.
The Conversion From Frequency to Presence
As is known, although the presence values provided by diaries cannot be converted into frequencies provided by items, the reverse process is possible. Using a similar approach to other authors (Presser & Chaves, 2007), to perform this conversion we have to add together the number of people and the relative typical frequency — thereby identifying the positive events with — and dividing the result by the number of possible events — In formal terms, .
However, it is unrealistic to ask respondents for such precise occurrence about their attendance at mass over the course of a year. In general, as in this case, it is preferable to offer a limited number of answer options I, that correspond to the frequency for each option i of the item. If now we set with , then . We will call the statistics and with the term stylized presence. To make this conversion, we must tackle an additional problem: identifying the values of frequency .
Before carrying out this task, we need to present the data.
The Data
The rationale that follows in the next two paragraphs will be developed with the use of two datasets. The first dataset consists of simulated data which will be used to present the different calibration models. The second dataset is the Time Use Survey conducted in Italy in 2008 (henceforth TUS, 2008), which will be used to present an empirical example of the benefits that can be achieved with calibration.
The Simulated Data
The first dataset consists of targeted constructed data in order to clearly highlight the differences in the application of the four models and will not be used to present a real application.6 The criteria that guided the construction of this data are twofold. The first was to visualize graphically in a more distinguishable manner the outcome of the application of the calibration models, a result not achievable with the real data. The second was to highlight the situations in which bias may emerge due to the application of the models to be discussed below.
The TUS 2008 Data
The second dataset is made up of TUS 2008 data belonging to the more general ISTAT Multipurpose Survey System which was generally conducted every five years. The survey used here was carried out in the timeframe February 2008–January 2009 (TUS, 2008). The respondents kept a diary over a 1-day period and recorded what they were doing (every 10 minutes) and where they were. In addition, they answered a detailed stylized questionnaire.
The sample consisted of 18,240 families with response rate equal to 73.96% (American Association for Public Opinion Research response RR1). A further selection due to the non-response diaries must be added to this sample dropout. In this case, of the 43,460 eligible diaries, including only subjects with 3 years or more, 40,944 were collected, broken down as follows: 14,787 relate to a weekday (Monday–Friday), 13,286 relate to a Saturday, and 12,871 relate to a Sunday. For the purposes of this study, we will single out respondents aged between 18 and 74, with a final sample of 30,673 people. To minimize the potential bias, the analysis was weighed by day of the week, gender, age, level of urbanization and multi-regional area.
As highlighted earlier, to make a comparison the same activity (here, religious practice) needs to be surveyed using both a diary and a suitable item.
For the diary, we used the codes regarding religious practice in places of religious worship.7 The minimum period of time considered in diaries is 10 minutes and is associated with the main activity carried out in that timeframe. A subject was counted as “present at mass” if there were at least two minimum-length episodes in their compiled daily diary, thus corresponding to attendance for a time equal or greater than 15 minutes.
Regarding the stylized item, the question used is: “How often do you usually go to church or another place of worship?”. For the available options the respective values of are identified with the following values applied where: 0 times for the option “never” with = 0.00; 6 times for “a few times a year” with = 6/364⋅100 = 1.65; 24 times a year for “a few times a month (but less than four times)” with = 6.59; 52 times a year for “once a week” with = 14.29; 182 times a year for “a few times a week” with = 50.00; and, finally, 364 times a year for “every day” with = 100.00.8
The Calibration Models
Let us now introduce the four models of calibration pertinent to our discussion and the problems associated with their application. As previously mentioned, in this section we will only use simulated data.
The Uniform Model
We will now consider an item administered to a sample of N individuals, where the possible response options I correspond to data values or frequency ranges. Let us now identify the sub-sample that selected the response option , from which we will derive probability that will be equal to , with If we now set with , and , then , where is the fraction of the population that carries out a given activity with a measured presence in the sub-sample equal to . We note that the assumption is important because it guarantees that there is a reasonable link between declared behaviour — as expressed in the item — and measured behaviour — as noted in the diary (Scappini, 2021).
If, instead of a categorical variable, we assume that X is a continuous variable, with area and , then , where is now the population density with measured presence equal to with . We will now calculate the coordinates of (
Starting from the abscissa values (), the corresponding ordinate values will be . For the two tails, if , we have with , while if , we have with . Given these assumptions, we can now build the calibration function. The uniform , hereinafter called , will be defined as follows:
Then the relative uniform hereinafter , is equal to:
defined as .
In this way, we obtained an initial result that is much less subject to bias than the one derived solely from the reliance on an item. However, the assumption of uniform distribution is extremely improbable in practice. It is unlikely the PDF pattern would feature break at the transition between the different values of . A more reasonable assumption is that the development is more progressive. In the next section, we will describe a solution to this problem.
The Linear Model
If we use the values defined with the uniform distribution and set and , we can now develop a model that better responds to the above mentioned criteria of progression: the linear , hereinafter called , will be defined as follows:
The relative linear hereinafter , will be equal to:
defined as with .9
This formulation has the advantage of a better graduality in the development of the values of and, therefore, in the development of the values of , as well as producing a non-discontinuous function. Let us now analyse why the application of the two models presented may be subject to bias.
The Bias in the Models
We are now going to introduce the information entered in Figure 1. Let’s start from , which is a probability density function, whose trend is determined by the values of and and is marked as on the graph. The values indicated with together with the line marked as Area () are useful to delimit the relevant reference areas. The continuous line shows the trend of the calibrated values , while the associated symbols on the same line, and , correspond to the specific calibrated values. In the first case, the calibration will be calculated for the values of , and therefore with reference to the measured presence, values which, we note, are usually helpful only to aid the reading of the graph, while in the second case the calibration will be calculated for the values of , and therefore with reference to the stylized presence.
Figure 1
Calibrated Uniform Model, and
Before continuing, we would like to point out that it is possible to calibrate , while the comparisons between andare feasible only for
From the comparison between Figure 1 and Figure 2, the improvement in terms of smoothness of linear vs uniform calibration is evident. The problem now arising is that neither of these formulations — and — guarantees that:
Figure 2
Calibrated Linear Model, and
In general, with the uniform calibration this doesn’t happen, as normally . In addition, the assumption of a gradual development of the function may generate a further asymmetry in the distribution of the probabilities . Therefore, even though , in general it will still be the case that .
Since it is not possible to generalize the attractive assumption discussed above, we propose as an alternative to take only the part that contributes in terms of CDF for a value related to the one given by , basically ignoring what happens after the value of . Therefore, as the average value divides the area of the part in two so that:
it is possible to consider equally attractive the following occurrence of equality:
1
Figure 1 shows an example of an ideally non-problematic situation, in which Equation (1) is verified, for with .10 However, if Equation (1) is not verified, then it is possible to regard this breach as a bias factor due to the calibration model. The Figure also shows the example of this situation for with .
To solve all the above problems, we need to look at an alternative model which we are going to illustrate.
The New Uniform Model
If we assume that X is a continuous variable, with and with an area , we can now calculate the respective coordinates of ( defined and ( defined . Starting from the abscissa values with , the corresponding ordinate will be equal to:
and
Regarding the two tails, we have if , with , and if , with . We can now develop the modified calibration.
The new uniform , hereinafter , is then equal to:
If we now set , then the relative , hereinafter , is equal to:
defined as .
We note that there are two special cases, where , and . In both these situations calibration for values of and is not possible. Then, the values of will be calculated assuming that is discrete and we will put respectively in the first case, and in the second.
While it is true that this model is not subject to bias since by definition , the function is discontinuous (see Figure 3), and the resulting values are less smoothed out compared to those shown in Figure 2.
Figure 3
Calibrated New Uniform Model, and
To sum up, we have now achieved a first result: a calibration model that is not subject to bias. However, it is also true that the assumption of uniform distribution is, in practice, very unlikely. Similarly to what we have already pointed out to justify the transition from to , it can be considered unrealistic to have “breaks” between adjacent values of in the trend of the PDF, while it would seem more reasonable to assume that the trend from to is more progressive. We will address the issue in the next section.
The New Linear Model
As in the case of the Linear model, if we use the values defined with the new uniform distribution and we place and , while leaving the two tails unchanged, a model can be developed that better meets the progressivity criteria now mentioned.
The new linear , hereinafter , is then equal to:
If we now set , then the relative , henceforth , will be equal to:
defined as .
Similar to the previous method, we observe that in the two special cases, those in which , and , the values of will be calculated without calibration: in the first case we will assume that , while in the second that .
If we take a look at Figure 4, we find we have a more attractive calibration model than the previous ones. While this is not, in general, a continuous model, like the Linear — — it is nevertheless a correct model and more smoothed out than the new Uniform — 11
Figure 4
Calibrated new Linear Model, and
We will now present the results of applying the calibration models to the TUS 2008 data.
Empirical Study
It has been shown that models named and may be subject to bias because they do not guarantee that . Next, it was shown that the and models have probably unreliable assumptions since it can be considered unrealistic to have “breaks” between adjacent values of in the trend of the PDF. It follows from this reasoning that the most interesting models are those that assume a more progressive trend and thus those denoted by and . However, the former, as has been shown, can be affected by bias, while the latter does not exhibit this problem. Consequently, in the comparisons we will carry out we will use only the most advanced calibration models, and , assuming the latter as the correct one.12
Let us now go on to apply the calibration to a real survey. The data and item related to the example we are going to propose, namely religious practice in Italy in 2008, lend themselves well to highlighting the important aspects we have drawn attention to.
We will carry out the discussion in two parts. First, we will reconfirm what is already known about the important overestimation of the retro-cumulated values calculated using the stylized items alone compared to the calibrated values. Second, we will compare a series of calibrated CDF values from the two models. As will be seen, beyond the formal aspects discussed, in practical use, or at least in the exemplification presented here, the values obtained are not very different from each other. Only in one situation among those elaborated, which however, is potentially re-presentable, did we detect a level of bias that can be considered relevant.
Regarding overestimation, an aspect that typically characterizes surveys on religious practice, we point out that the bias “produced” by stylized items takes on considerable values. To give an example (see Table 1 and Figure 5) if we consider those who say they go to Mass once a week (Option 4, = 14.29), compared with a value of , we have that and . Very large differences in both absolute (> 20 percentage points) and relative EI > 250% terms.13 The situation is not much better if we consider the values of and in the other options, with EI varying, respectively, from a minimum of 33/44%, in Option 2, to a maximum of 680/995%, in option 5.14
Table 1
Calibrated TUS 2008 Data, Stylized , Linear Model and New Linear Model
| Options (i) | 1 | 2 | 3 | 4 | 5 | 6 | Total |
|---|---|---|---|---|---|---|---|
| Measured Presence (pᵢ) % | 0.24 | 0.88 | 3.39 | 10.74 | 17.32 | 51.49 | 5.124 |
| Stylized Presence (sᵢ) % | 0.00 | 1.65 | 6.59 | 14.29 | 50.00 | 100.00 | 9.360 |
| Sample % | 14.9 | 33.8 | 21.2 | 23.4 | 5.6 | 1.1 | 100.0 |
| N | 4,565 | 10,372 | 6,488 | 7,172 | 1,728 | 348 | 30,673 |
| Retro-cumulative population % | |||||||
| Stylized | 100.0 | 85.1 | 51.3 | 30.2 | 6.8 | 1.1 | |
| Calibrate | 100.0 | 63.8 | 32.8 | 8.9 | 0.87 | 0.0 | |
| Calibrate | 100.0 | 58.5 | 31.3 | 8.4 | 0.62 | 0.0 | |
Note. Stylized question: “How often do you usually go to church or another place of worship?”; frequency options: 1. Never, 2. A few times a year, 3. A few times a month (but fewer than four times), 4. Once a week, 5. A few times a week, 6. Every day.
Figure 5
Calibrated TUS 2008 Data: Stylized , Linear Model and New Linear Model Retro-Cumulative Function
To better highlight the total size of the errors, a measure of fit between the Stylized and the and distributions can be used. This measure, defined as the weighted Adjustment Indicator (wAI), is derived from the weighted Mean Absolute Error (wMAE).15 Intuitively, the wAI indicates the degree of similarity between the calibrated distributions and the stylized one: higher values suggest closer alignment and a reduced effect — or usefulness — of applying the model. Comparing the obtained wAI values, we observe percentages of 84% and 81% for the linear and new linear models and suggest a relatively large distance between the distributions.
It should be noted that the comparison between uncalibrated and calibrated values is also relevant for theoretical discussion. While using the items it can be assumed that religious practice constitutes a relevant phenomenon in Italy in 2008 as regular practitioners are an important fraction of the population (i.e., 30.2%), differently with the use of calibrated values it can be inferred that religion is a relatively minor phenomenon (i.e., 8.9/8.4%).16
Let us now turn to the comparison between the two calibration models studied. In this case the differences detected in their application are relatively small.17 Only in Option 2 (do we have a discrepancy, which can be relevant, with an overestimation of with respect to equal to 5.3 percentage points and with EC = 9%. In the other options, the deviations are not as important, with errors of less than two percentage points and with EC < 6%. Only in relative terms does Option 5 = 50.00) show considerable overestimation (EC = 40%) but we are dealing with very small values so that the absolute differences are quite negligible — in this case equal to 0.25 percentage points — i.e., .
The comparisons now presented show that the differences are generally not relevant and thus almost negligible in the theoretical discussion. The model, however, remains preferable not only because of the attractive fact that it is non-biased, but also because, in given situations, it allows us to better delimit the size of particular or specific subgroups, such as those who practice relatively intensively — i.e., or those who participate very rarely or never18 — i.e., while .
Conclusion
We now summarize the results. I think there are two points that are relevant and need to be focused on. The first concerns the choice of the most appropriate model to calibrate the data; the second pertains to the prerogatives of calibration.
Of the four calibration methods, we can summarize that the linear method, while attractive because of the contiguity of the functions and describing the trend, may have limitations related to the bias discussed above. This method is superseded by the new Uniform model, which is less attractive than the Linear model because it can introduce major discontinuities in the transition from one option to another. We believe, therefore, that the last model presented — n — is undoubtedly preferable because, while generally it still shows discontinuities, it does not have the disadvantages of the Linear model in terms of bias or even those of the new Uniform model in terms of smoothness of the results.
Subsequently, I applied the models using data on religious practice. It should be noted, however, that calibration has the distinctive advantage of being applicable in many other areas. We will now examine some — though by no means all — of the possible fields of application.
First, time use surveys often include a questionnaire with stylized items, alongside diaries: this is seen in studies of Mass attendance in Canada (Brenner, 2011) and work hours in Germany (Otterbach & Sousa-Poza, 2010). The model could also be used in surveys measuring transport usage. In this case, the need for diaries covering many weeks could be simplified by joint use of diaries and questionnaires (Axhausen et al., 2002).
Furthermore, this method could be extended to the psychological/medical sphere, such as studies on the consumption of alcohol (Townshend & Duka, 2002) or food (Vereecken & Maes, 2003). In this case, the two tools are often used interchangeably. Using them together could increase precision and simplify data collection in cases where the analysis needs to be extended over the long term.
In short, regarding the prerogatives of calibration, we have already discussed enough about the “advantages” of being able to perform unbiased analysis on phenomena that have long time cycles. Here we just want to point out that the application presented was used for demonstration purposes only. In other words, the models are independent of the specific field of substantive research and in fact, with the appropriate data, can be applied to a wide variety of social phenomena.
Future Work
However, this article does not fully address several important topics that require further investigation. While the implemented applications effectively demonstrate the model by meeting its minimum criteria, further research would be useful. For example, a study assessing the adequacy of the overlap between the survey items and diary-recorded activities, as some activities may not fully satisfy these assumptions. Additionally, refining model fit measures and calculating confidence intervals for parameters are necessary steps. The current approach to model fit is not completely satisfactory, but no better alternative has yet been identified.
Future research will prioritize resolving these issues to significantly improve the model’s robustness and broaden its applicability.
This is an open access article distributed under the terms of the Creative Commons Attribution License (