Canada is divided into geographical units called dissemination areas (DA), which consist of 400 to 700 inhabitants and whose boundary lines mostly follow roads. We used data from 3,202 DAs located in the city of Toronto with an average population of 690 individuals and average physical size of 172,290 m2.

We combined data from three different sources to construct our tree, health and demographic variables:

The first source of tree canopy data came from the ‘Street Tree General Data,’ which is a Geographical Information System (GIS) dataset that lists the locations of over 530,000 individual trees planted on public land within the city of Toronto. This dataset comes from experts who traversed the city of Toronto and recorded tree species and diameters at breast height. Trees in public parks are not listed as the listed trees were only from public land that lines the streets. The set contains each tree’s common and botanical names, their diameters at breast height (DBH), the street addresses and the general location reference information. Figure 1 shows the green-space map of Toronto generated from these data for illustration.

The second source of tree canopy data came from the Geographical Information System (GIS) polygon data set ‘Forest and Land Cover,’ which contained detailed areal information of tree canopies in Toronto. In these data, the satellite imagery resolution was 0.6 m – from QuickBird Satellite imagery, 2007. The treed area was calculated using automated remote sensing software - Ecognition. This automated land-cover map was then monitored by staff from the University of Vermont Spatial Analysis Lab and adjusted to increase accuracy. In this dataset there is the ability to differentiate shrub cover from trees. There is, however, some susceptibility to errors when differentiating large shrubs from trees. To validate the accuracy of the QuickBird satellite imagery, it was compared with two other methods used to assess tree canopy cover: 1) Ocular estimates of canopy cover by field crews during data collection in 2008; 2) 10,000 random point samples of leaf-off and leaf-on aerial orthophotos (imagery available in required orthorecitifed format included 1999, 2005 and 2009)34. The tree canopy coverage estimates for each of the respective approaches were: QuickBird: 28%; Ocular: 24%; and Aerial Orthophotos: 26.2% respectively34. Because of the similarity in results, we can be confident in the accuracy of the QuickBird satellite results. For more information on the automated classification of leaf-on tree canopy from the 2007 satellite imagery see Appendix 4 of34. Figure 2 shows a map of tree canopy in each dissemination area as generated from the QuickBird Satellite.

Information about individuals’ health and demographics was obtained in the context of the Ontario Health Study (https://www.ontariohealthstudy.ca). This is an ongoing research study of adults (18 years and older) living in the Canadian province of Ontario aimed at investigating risk factors associated with diseases such as: cancer, diabetes, heart disease, asthma and Alzheimer’s Disease. The data were collected using two (similar) versions of a web-based questionnaire consisting of demographic and health-related questions. These questionnaires were completed by 94,427 residents living in the greater Toronto area between September, 2010 and January, 2013. For this study, we used data from a subset of 31,109 residents (31,945 respondents, out of which 827 were removed during quality control for having duplicate records and 9 were removed because of missing consent records). A record was considered a duplicate with the following data quality checks: 1) Multiple registrations of the same Last Name, First Name and Date of Birth 2) Multiple registrations of the same Last Name, First Name and Postal Code 3) Multiple registrations of the same Last Name, First Name, Date of Birth and Postal Code 4) Multiple registrations of the same email address. Additional data quality checks included several built-in checks in the online system, which included automatic skip patterns and limited ranges for free text numerical responses such that participant responses must be within reasonable limits. The final sample included individuals who resided in the 3,202 dissemination areas of the city of Toronto as individual tree data were only available for these areas. These dissemination areas are shown in Fig. 3.

Demographic Variables

For each individual, we used sex (59% female; compared to the population male/female ratio: Toronto’s population was 48.0% male and 52.0% female in 2011 according to Statistics Canada), age (Mean = 43.8, range = 18–99; as of 2011 the mean age of residents above 19 years of age for the entire population of Toronto is: 47.9 according to Statistics Canada), education (coded as: 1 = none (0.0%), 2 = elementary (1.0%), 3 = high school (15.3%), 4 = trade (3.3%), 5 = diploma (15.9%), 6 = certificate (5.9%), 7 = bachelor’s (35.3%), 8 = graduate degree (23.3%), with Mean = 6.07, range = 1–8; According to the 2011 National Household Survey in www.toronto.ca, the distribution of education for the entire city of Toronto is the following: 33% of all City residents 15 years and over have a bachelor degree or higher, 69% of City residents between the ages of 25 and 64 years have a postsecondary degree, 17% of 25–64 years old residents have graduate degrees) and annual household income (coded as: 1 = less than $10 000, 2 = $10 000 – $24 999, 3 = $25 000 – $49 999, 4 = $50 000 – $74 999, 5 = $75 000 – $99 999, 6 = $100 000 – $149 999, 7 = $150 000 – $199 999, 8 = $200 000 or more, with Mean = 4.67 which is equivalent to $90 806 annual income range = 1–8; compared to the entire city of Toronto’s population mean household income, which was: $87,038 in 2010 according to Statistics Canada), as well as diet (number of fruits and vegetable servings respondent consume every day, with Mean = 2.24, range = 0–10), as potential confounding variables. In addition, for each dissemination area we used the area median income from Statistics Canada and coded those data the same as the household income data, with mean = 4.08, range = 2–8. Population densities in a given DA were used in the multiple imputation analysis but not as a variable in the regressions or the canonical correlation analyses. The correlations between demographic variables can be found in Figure S2 of Supplementary Information.

Our studied sample had similar demographics to the entire city of Toronto, but was slightly younger (mean age = 43.8; Toronto population = 47.9), slightly more female (59%; Toronto population = 52%), slightly more educated (35.3% had bachelor’s degrees vs. 33% in the Toronto population) and slightly wealthier (mean household income = $93,399 vs. $87,038 in the entire city of Toronto).

Green-space variables

Crown area of the trees was used to calculate the density of area covered by trees separately for the trees on the streets and the trees from greenspace in private locations and parks in each DA. We estimated the crown area of the trees based on their diameter at breast height (DBH) values. We obtained formulas for estimating tree crown diameter based on DBH for 8 tree types (Maple, Locust, Spruce, Ash, Linden, Oak, Cherry and Birch) that were derived from forestry research. Forestry researchers have fit linear and non-linear models to relate crown diameter and DBH for different species of trees. These models achieved good fits as verified by their high R2 values (above 0.9)35,36. The formulas that were used to estimate crown diameters from DBH for these tree types and their references can be found in the Supplementary Equations section of the Supplementary Information. These 8 tree species covered 396,121 (83%) of the trees in our dataset. For the other 81,017 (17%) of the trees, we estimated crown diameter based on the linear regression of crown diameters on DBHs obtained from the 83% of the trees belonging to the tree types with known crown formulas. The crown areas of all the trees were then calculated using the crown diameters and assuming that the crown areas were circular in shape.

Street tree density for each dissemination area was quantified as the total area of the crowns of trees (m2) beside the streets in the dissemination area over total dissemination area size (m2) multiplied by 100 to be in percentage format. The range for this variable was found to be from 0.02% in the areas with the least street tree density to 20.5% in the areas with highest street tree density (Mean = 4.57%). Other tree density for each dissemination area was calculated by subtracting out the area covered by crowns of the trees on the streets (street tree area) from the total treed area (m2) in the dissemination area (from the satellite Tree Canopy data) and then dividing that by the area size and multiplying by 100 to be in percentage format. The range for this variable was found to be from 0.00% in the areas with almost no trees in parks (or no parks), no domestic gardens or other open areas; to 75.4% in areas with high tree density and parks (Mean = 23.5%). As mentioned above, there was limited ability to differentiate large shrub cover from tree cover in the satellite data. Therefore, the variable “other tree density” could contain some unwanted large shrub cover as well, especially for areas with very high other tree density.

Health variables

All of the health variables were constructed from the self-reported items in the Ontario Health Study (OHS). Items related to disorders were based on the question “Have you ever been diagnosed with …?” and coded with 0 = No and 1 = Yes. These consisted of physical conditions including high blood pressure, high cholesterol, high blood glucose, heart attack (MI), stroke, heart disease, migraines, chronic obstructive pulmonary disorder (COPD), liver cirrhosis, ulcerative colitis, irritable bowel disease (IBD), arthritis, asthma, cancer and diabetes (DM), as well as mental health conditions including addiction, depression and anxiety. About 66.3% of all respondents reported having at least one of the mentioned health conditions. The percentages of “Yes” responses for each of these conditions are reported in Supplementary Table S3. Additionally, body mass index (BMI) for each person was calculated from his/her self-reported height and weight. Our “Obesity” variable was constructed as 0 for BMI below 25, 0.5 for BMI between 25 and 30 (overweight, 26% of respondents) and 1 for BMI over 30 (obese, 13% of respondents). Other variables drawn from these data are general health perception (self-rated health (1 = poor, 2 = fair, 3 = good, 4 = very good, 5 = excellent, with Mean = 3.66, range = 1–5) and four more variables that were used in the multiple imputations to increase the accuracy of imputations: walking (the number of days a participant has gone for a walk of at least 10 minutes in length last week, with Mean = 5.33, range = 0–7), smoking (if participant has ever smoked 4-5 packs of cigarettes in their lifetime, 38% Yes), alcohol consumption frequency (coded as 0 = never, 1 = less than monthly, 2 = about once a month, 3 = two to three times a month, 4 = once a week, 5 = two to three times a week, 6 = four to five times a week, with Mean = 3.60, range = 0–7) and alcohol binge frequency (coded as 0 = never, 1 = 1 to 5 times a year, 2 = 6 to 11 times a year, 3 = about once a month, 4 = 2 to 3 times a month, 5 = once a week, 6 = 2 to 3 times a week, 7 = 4 to 5 times a week, 8 = 6 to 7 times a week, with Mean = 1.62, range = 0–8).

The dependent variables related to physical and mental health were created from the multiple-imputed data. For each complete dataset, the Cardio-metabolic Conditions index was constructed by summing the following seven variables related to cardio-metabolic health: High Blood Glucose, Diabetes, Hypertension, High Cholesterol, Myocardial infarction (heart attack), Heart disease, Stroke and “Obesity” with Mean = 0.89, range = 0–8. The Mental disorders index was constructed by summing Major Depression, Anxiety and Addiction, with Mean = 0.26, range = 0–3. The Health Perception index was the third dependent variable in our analyses with Mean = 3.66, range = 1–5. The Other disorders index consisted of Cancer, Migraines, Asthma and Arthritis (Mean = 0.48, range = 0–4. This index was constructed to be a control variable in the canonical correlation analysis. The additional variables (e.g., cirrhosis) were included to increase the accuracy of the imputation, but were not analyzed. The correlation matrix between the health variables, the tree variables and the demographic variables is reported in supplementary Figure S2 of the Supplementary Information.

Multiple imputations analysis

The self-reported health data contained some missing values for different variables (mainly due to “I don’t know” responses). List wise deletion of the data (keeping only participants with no missing values in any of the items) would have resulted in a loss of 73% of the participants because the missing values in the different items were distributed across subjects and was therefore an unreasonable method of analysis. To handle the missing data problem, we assumed that the data were missing at random (MAR), meaning that the probability of missingness for a variable was not dependent on the variable’s value after controlling for other observed variables. We then replaced the missing values with multiple imputed data37,38,39. Thirty complete datasets were created from the original dataset using the estimate and maximize (EM) algorithm on bootstrapped data implemented by the Amelia package for R [Amelia40;]. All of the 30 imputations converged in less than 11 iterations. Variables used in the imputations and their missing percentages are reported in Supplementary Table S4.

Regression analysis

The regression analyses were performed separately for each imputed dataset and then combined based on Rubin’s rules38 using the Zelig program in R41. Rubin suggested that the mean of each regression coefficient across all imputed datasets be used as the regression coefficients for the analysis. In addition, to avoid underestimation of standard errors and taking the uncertainty of the imputed values into account, both the within imputation variance and between imputation variance of each coefficient should be used to construct the standard error for each regression coefficient. Lastly42, proposed using degrees of freedom estimated as a function of the within and between imputation variance and the number of multiple imputations when approximating the t-statistics for each parameter.

To assess the amount of the variance in the dependent variables that is explained by the regression model for the multiple imputed data we used the method suggested by Harel (2009) to estimate the R2 and the adjusted R2 values. Based on this method, instead of averaging R2 values from the 30 imputations, first the square root of the R2 value (r) in each of the imputed datasets is transformed to a z-score using Fisher’s r to z transformation, z = atanh(r). The average z across the imputations can then be calculated. Finally, the mean of the z values is transformed back into an R2. The same procedure can be used for adjusted R2 values. Harel (2009) suggests that the number of imputations and the sample size be large when using this method, which holds true in the current study. Also, the resulting estimates of R2 could be inflated (i.e. are too large), while estimates of adjusted R2 tend to be biased downwards (i.e. are too small). Therefore, we estimated both values for a better evaluation of the explained variance.

Canonical correlation analysis

To investigate further the relationship between the two sets of variables, namely the health-related variables (Health Perception, Cardio-metabolic conditions, Mental Disorders and Other Disorders) and the demographic and green-space variables (Age, Sex, Education, Income, Area income, Diet, Street Tree Density and Other Tree Density), we performed a canonical correlation analysis43,44. Our model is presented in the diagram shown in Fig. 4. Mauchly’s test of sphericity was performed on the average of imputations in MATLAB (Sphertest: Sphericity tests menu) and showed that the correlation matrix of the data is significantly different from the identity matrix (p < 0.0001). This significant departure of the data from sphericity warrants the canonical correlation analysis.

In a canonical correlation analysis, first, the weights that maximize the correlation of the two weighted sums (linear composites) of each set of variables (called canonical roots) are calculated. Then the first root is extracted and the weights that produce the second largest correlation between sum scores is calculated, subject to the constraint that the next set of sum scores is orthogonal to the previous one. Each successive root will explain a unique additional proportion of variability in the two sets of variables. There can be as many canonical roots as the minimum number of variables in the two sets, which is four in this analysis. Therefore, we obtain four sets of canonical weights for each set of variables and each of these four canonical roots have a canonical correlation coefficient which is the square root of the explained variability between the two weighted sums (canonical roots).

To obtain unbiased canonical weights for variables and canonical correlation coefficients, we averaged data values over the 30 imputations and performed canonical correlation analysis on the z-scores of the averaged data using MATLAB (MATLAB and Statistics Toolbox Release 2014a, The MathWorks, Inc., Natick, Massachusetts, United States). For a more straight-forward interpretation and better characterization of the underlying latent variable, instead of using the canonical weights, we calculated the Pearson correlation coefficient (canonical loading) of each observed variable in the set with the weighted sum scores for each of the four linear composites. This way, each canonical root (linear composite) could be interpreted as an underlying latent variable whose degree of relationship with each of the observed variables in the set (how much the observed variable contributes to the canonical variate) is represented by the loading of the observed variable and its errorbar (see canonical correlation results).

To estimate the standard errors of the canonical loadings, we bootstrapped z-scores from each of the 30 complete imputed data (1000 simulations for each) and performed canonical correlation analysis 30000 times using MATLAB. Then, we calculated the variances of the set of loadings, which were calculated as explained above, over each completed dataset (within imputation variance). We also calculated the variance of the 30 sets of coefficients (between imputation variance). The standard errors of the coefficients were then estimated using the same Rubin’s rules as was done for the regression analyses.