The Distribution of Human Capital in Sweden, ch. 3

3. Data

This chapter contains descriptive information about the data used in the study. It aims to enable an assessment of the validity of the data and will provide some illustrations of its characteristics. In the next chapter, each variable’s role in the identification strategy will be discussed.

3.1 Data sources

The data have been collected from the Longitudinal Individual Database (LINDA) at Statistics Sweden (SCB). Exhaustive descriptions of the database and its contents can be found on Statistics Sweden’s web page: www.scb.se/eng.

LINDA is a register-based database and consists of a large panel of individuals, representative for the Swedish population. The panel was created in 1994 by drawing a random sample of 300 000 individuals from the Swedish population, corresponding to about 3 percent of the total population at the time. SCB then backtracked the panel to 1968, which was the first year of available income statistics. To obtain each following year’s sample, it has since been revised annually by sampling from the inflow of the population (births and immigrations) to replace the outflow of the panel (deaths and emigrations) and ensure that it continues to correspond to about 3 percent of the population each year. Thus , the data are also cross-sectionally representative (Edin, Fredriksson, 2000).

The wage data in the panel originates from Statistics Sweden’s annual wage structure survey (Lönestrukturstatistiken). An essential characteristic of this survey is that the employers who are contacted by Statistics Sweden to supply information on the wages of their employees are legally obligated to do so (Gustavsson, 2004). Hence the validity of the data is exceptionally high. But unfortunately, there is an inconsistency in the coverage of the wage information. Between 1992 and 1997 only about half of the individuals employed in the private sector are included in the wage structure survey, whereas all individuals employed in the public sector are included. As a result, the wage information is more likely to be missing in LINDA for those who were employed in the private sector during that time. And because the sampling frame for the private sector in the wage structure survey is stratified in accordance with industrial affiliation and number of employees where large firms have a larger probability of being sampled, it is mainly the wages of employees in small firms between 1992 and 1997 who are underrepresented in the data (Gustavsson, 2004). Howeverthis discrepancy between the sectors is corrected from 1998 onwards, at which point SCB began conducting an additional wage survey for those whose wage information in LINDA was missing.

3.2 Analysis sample

Due to restrictions in the longevity of some variables, the analysis sample is restricted to data between 1992 and 2005 on individuals with a wage aged 18–65. This brought the total number of observations in the analysis sample down to 1 325 431 from 4 302 353 in the original dataset of the specified time period. Additionally , due to changes in nomenclature over the years, it has been necessary to recode certain variables to be able to form a coherent longitudinal dataset out of the year-by-year data. This process, and the variables used in the study, is described below.

3.3 Variables

The dependent variable is a standardized measure of monthly wages, converted to fulltime employment equivalents as follows:

(3.1) wage = w0 + ((w1 + w2+ w3)/λ))

(3.2) wage = w0 + [((w1 + w2+ w3)/λ))] ∗ 4,35

(3.3) wage = w0 + [((w1 + w2+ w3)/h))] ∗ 165

Where equation (3.1) applies to monthly wages, equation (3.2) to weekly wages and equation (3.3) to hourly wages. h corresponds to the number of hours worked, and λ to the percentage of employment (fulltime implies λ = 1). wo corresponds to call allowance and other cash compensations, w1 is fixed salary and performance allowance, w2 is performance salary and bonuses and w3 is shift- and unsocial hour’s allowances. When combined, wo, w1, w2 and w3 constitute all taxable wage components in Sweden. The standardized monthly wages are then transformed by taking the natural logarithm: log e (wage) =  ln_wage.

To obtain a measure of labor force experience, each individual’s potential experience prior to their first entry into the panel is calculated and added to observed experience. The components of potential labor force experience are age and years of schooling, calculated as:

Potential experience = (Ageit*Years of schoolingi − 7)

Where Ageit* denotes the age of individual i in year t* indicating the year of entry into the panel. Years of schoolingi depends on the educational level of the individual, taking the value 9, 12 or 16 corresponding to elementary school, high school and college.

The industrial affiliation of each individual’s main employer is contained in the variable industry, following the Swedish Standard for Industrial Classification from 1992 (SNI92). This variable was originally reported on a 5-digit level but has been aggregated to obtain analytically coherent industrial segments. Furthermore , data on the county of residence of each individual is recorded in the variable county. However , due to consolidations of counties in 1997 and 1998 the original data is not comparable over time. This was solved by recoding the data that predated the latest consolidation, thereby mapping the regional structure obtained after the consolidations to the entire time period. Additional variables obtained from LINDA include a dummy variable, childt, indicating the residence of a child less than 16 years of age in the household, as well as a dummy variable indicating fulltime employment, fulltimet.

In addition to the time-varying variables previously mentioned, the time-invariant variables gender and immigrant and were also retrieved from LINDA.

3.4 Summary statistics

Since the extraction of the analysis sample considerably reduced the number of observations, it is important to analyze the representativeness of the attained sample in relation to the original dataset. Hence, a comparison between the original dataset and the analysis sample is presented in Table 3.1.

Table 3.1 [Table not shown]

From the top three rows of Table 1 we see that the analysis sample contain proportionally more immigrants, males and individuals with children living in the household. Since all persons over the age of 65 are excluded in the analysis sample, an increase in the proportion of males is expected due to the higher life expectancy of women. We would also expect the senior population and individuals less than 18 years of age to be less likely to have children living in the household. Hence the increase in the proportions of males and individuals with children is anticipated. The rather large increase in the proportion of immigrants, however , is more difficult to explain. Certainly, some of the increase is due to labor force immigration increasing the probability that a person who’s an immigrant also has a wage, thereby increasing the probability of inclusion in the analysis sample. On the other hand, the occurrence of discrimination on the labor market, as well as the difficulties inherent in living in a country whose language isn’t your mother tongue, would influence the probability of inclusion in the opposite direction. However, whatever may be the cause of this disproportionality, I do not think that it limits the generalizability of the study in any substantial way.

Our  analysis sample is also slightly more educated than the original dataset, which is in line with the general perception that education increases the probability of employment. Turning our attention to the proportions of the different levels of the factor variables county and industryit is striking how similar the analysis sample is to the original dataset. With the exception of the industry segment Health and social care which is slightly overrepresented in the analysis sample, this suggests that the structural differences between the counties and industrial segments in Sweden, pertaining to age and the likelihood of employment, are quite small.

3.5 Transition tables

Since the identification strategy in this study uses the variation within each cross-section unit to estimate the parameters of interest, it is important that the variables in the model vary sufficiently for the parameters to be well identified. For example, choosing a base level for a factor variable becomes an issue of weighing on one hand the proportion of the sample that ever exhibited the base level value, and on the other the variability of the factor variable at that particular value. Table 3.2 presents some calculations of the overall-, between– and within variation on the time-varying variables, to be interpreted as follows:

• The overall variation corresponds to the percentage of all observations that has the specified value of a variable.
• The between variation is the percentage of individuals for which the specified value of a variable is observed in at least one time period.
• Finally, given that the specified value of a variable is observed for an individual, the within variation is the total percentage of that individual’s observations on the variable that corresponds to the specified value. Hence, it’s a measure of the within individual variability of the variable.

To exemplify: a variable whose value doesn’t change over time will have a within variation percentage of one hundred (such as ethnicity), while a variable who never takes the same value twice and thus always changes from one time period to the next will have a within variation percentage of zero (such as year). The rows with totals are just the sums for the overall and between measures, while it corresponds to the overall variability of a variable for the within measure.

Table 3.2 [Table not shown]

Since the number of observations in the sample exceeds 1.3 million, a quick examination of the calculations in Table 2 suggests, with the exception of Extra-territorial organizations in industry, that there shouldn’t be any difficulties to estimate the parameters for the timevarying variables in the model.