--- title: "Auscensus - Getting Data" resource_files: - vignetttes/house-preference-flow*.png author: "Carlos YANEZ SANTIBANEZ" date: "2023-01-14" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Auscensus - Getting Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Once the data has been imported, the package is ready to pull data from the data packs. ## Figuring out what to retrieve The first thing to figure out what data points are going to be retrieved. In order to answer that, the following questions need to be answered: - Which statistics do I want to pull, i.e. which data point from which table? - Which geographical structure do I want (e.g. SAs, LGAs, electorates)? Do i want data for all of them or just a handful? - For which years do I want the data? To answer those questions, {auscensus} comes with a family of functions, all starting with *list_*. To figure out where is the data point to retrieve, the first step to retrieve a list of available tables, which will return their code plus the year when they have been published. this can be achieved with *list_census_tables()* ```r list_census_tables() #> # A tibble: 165 × 7 #> Number `Table Name` `Table Population` `2021` `2016` `2011` `2006` #> #> 1 01 Selected Person Characteristics by Sex Persons TRUE TRUE TRUE NA #> 2 01 Selected Person Characteristics by Sex (First Release Processing) Persons NA NA NA TRUE #> 3 02 Selected Medians and Averages NA TRUE TRUE TRUE TRUE #> 4 03 Place of Usual Residence by Place of Enumeration on Census Night by Age Persons (excludes overseas visitors) TRUE NA NA NA #> 5 03 Place of Usual Residence on Census Night by Age Persons NA TRUE TRUE TRUE #> 6 04 Age by Sex Persons TRUE TRUE TRUE TRUE #> 7 05 Registered Marital Status by Age by Sex Persons aged 15 years and over TRUE TRUE TRUE TRUE #> 8 06 Social Marital Status by Age by Sex Persons aged 15 years and over TRUE TRUE TRUE TRUE #> 9 07 Indigenous Status by Age by Sex Persons TRUE TRUE TRUE TRUE #> 10 08 Ancestry by Country of Birth of Parents Responses and persons TRUE TRUE NA TRUE #> # … with 155 more rows ``` This function comes with two in-built filters. The first one allows filtering by table number, whereas the second uses a regular expression to filter the table names ```r list_census_tables("20") #> # A tibble: 4 × 7 #> Number `Table Name` `Table Population` `2021` `2016` `2011` `2006` #> #> 1 20 Count of Selected Long-Term Health Conditions by Age by Sex Persons TRUE NA NA NA #> 2 20 Unpaid Domestic Work: Number of House by Age by Sex Persons aged 15 years and over NA TRUE NA NA #> 3 20 Unpaid Domestic Work: Number of Hours by Age by Sex Persons aged 15 years and over NA NA TRUE NA #> 4 20 Unpaid Assistance to a Person with a Disability by Age by Sex Persons aged 15 years and over NA NA NA TRUE ``` ```r list_census_tables(table_name_regex = "[Cc]ountry") #> # A tibble: 6 × 7 #> Number `Table Name` `Table Population` `2021` `2016` `2006` `2011` #> #> 1 08 Ancestry by Country of Birth of Parents Responses and persons TRUE TRUE TRUE NA #> 2 09 Country of Birth of Person by Age by Sex Persons TRUE TRUE NA NA #> 3 09 Country of Birth of Person by Sex Persons NA NA TRUE TRUE #> 4 10 Country of Birth of Person by Year of Arrival in Australia Persons born overseas TRUE TRUE TRUE TRUE #> 5 26 Family Composition and Country of Birth of Parents by Age of Dependent Children Dependent children NA TRUE NA NA #> 6 30 Family Composition and Country of Birth of Parents by Age of Dependent Children Dependent children TRUE NA NA NA ``` Please note that even though the table names have changed from year to year, all tables with the same number contain the same data. Once the table has been selected, *list_census_attributes()* can be used to select a data point. Since tables can contain many attributes, thus this function also allows filtering the results by a regular expression. For instance, if we pick table *04* (Age by Sex) and want to find the data points for 60-year-olds, plus the total, we can use the below command. ```r list_census_attributes(number="04",attribute_regex = "60|[Tt]otal") #> # A tibble: 15 × 6 #> Table Attribute `2021` `2016` `2011` `2006` #> #> 1 04 Age_years_60_males TRUE TRUE TRUE NA #> 2 04 Age_years_60_females TRUE TRUE TRUE NA #> 3 04 Age_years_60_persons TRUE TRUE TRUE NA #> 4 04 Age_years_60_64_years_males TRUE TRUE TRUE NA #> 5 04 Age_years_60_64_years_females TRUE TRUE TRUE NA #> 6 04 Age_years_60_64_years_persons TRUE TRUE TRUE NA #> 7 04 Total_males TRUE TRUE TRUE TRUE #> 8 04 Total_females TRUE TRUE TRUE TRUE #> 9 04 Total_persons TRUE TRUE TRUE TRUE #> 10 04 Age (Years): 60_males NA NA NA TRUE #> 11 04 Age (Years): 60_females NA NA NA TRUE #> 12 04 Age (Years): 60_persons NA NA NA TRUE #> 13 04 Age (Years): 60-64 Years_males NA NA NA TRUE #> 14 04 Age (Years): 60-64 Years_females NA NA NA TRUE #> 15 04 Age (Years): 60-64 Years_persons NA NA NA TRUE ``` To get which geographical structures are available, *list_census_geo_types()* can help. Once selected an appropriate value, *list_census_geo()* can be used to search for particular divisions. In the example below, we are confirming if data is available for three inner-Melbourne Local Government areas. ```r list_census_geo_types() #> # A tibble: 31 × 5 #> ASGS_Structure `2021` `2016` `2011` `2006` #> #> 1 AUS TRUE TRUE TRUE TRUE #> 2 CD NA NA NA TRUE #> 3 CED TRUE TRUE TRUE TRUE #> 4 GCCSA TRUE TRUE TRUE NA #> 5 IARE TRUE TRUE TRUE TRUE #> 6 ILOC TRUE TRUE TRUE TRUE #> 7 IREG TRUE TRUE TRUE TRUE #> 8 JWSLA NA NA NA TRUE #> 9 JWSLA_extended NA NA NA TRUE #> 10 LGA TRUE TRUE TRUE TRUE #> # … with 21 more rows ``` ```r list_census_geo("LGA",geo_name_regex = "[Mm]elbourne|[Ss]tonnington|Yarra") #> # A tibble: 5 × 6 #> ASGS_Structure Census_Name `2021` `2016` `2011` `2006` #> #> 1 LGA Melbourne TRUE TRUE TRUE TRUE #> 2 LGA Stonnington TRUE TRUE TRUE TRUE #> 3 LGA Yarra TRUE TRUE TRUE TRUE #> 4 LGA Yarra Ranges TRUE TRUE TRUE TRUE #> 5 LGA Yarrabah TRUE TRUE TRUE TRUE ``` Please note that the above functions only show data for the *available* data packs - if a file has not been imported, its associated year won't be presented in the results. ## Retrieving Census Data Once the questions have been resolved, it is possible to use {auscensus} to retrieve the data with *get_census_data()*. This function will present the result for each {table, year, geography} combination as a different element within a list. By default the function does not return data but a parquet binding function - this can be changes by setting *collect_data = TRUE*. For example: ```r data <- get_census_data(census_table = list_census_tables("04"), geo_structure = "LGA", collect_data = TRUE) #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/LGA_Vic_B04_A.csv': resource busy or locked #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2011Census_B04A_AUST_LGA_short.csv': resource busy or locked #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2016Census_G04A_AUS_LGA.csv': resource busy or locked #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2021Census_G04A_AUST_LGA.csv': resource busy or locked names(data) #> [1] "2006_LGA_B04" "2011_LGA_B04" "2016_LGA_G04" "2021_LGA_G04" ``` ```r data$`2006_LGA_B04` #> # A tibble: 685 × 309 #> Year Census_Code Unit Age (…¹ Age (…² Age (…³ Age (…⁴ Age (…⁵ Age (…⁶ Age (…⁷ Age (…⁸ Age (…⁹ Age (…˟ Age (…˟ Age (…˟ Age (…˟ Age (…˟ Age (…˟ Age (…˟ #> #> 1 2006 LGA89399 Unin… 9798 10282 2121 20080 2214 10436 4335 10924 21360 16 7 23 2021 2068 4089 2094 #> 2 2006 LGA89499 No U… 21 26 4 47 6 18 10 8 26 0 0 0 3 0 3 3 #> 3 2006 LGA10050 Albu… 1490 1524 305 3014 337 1675 642 1747 3422 8 4 12 305 326 631 357 #> 4 2006 LGA10110 Armi… 660 704 144 1364 137 910 281 899 1809 0 0 0 163 157 320 173 #> 5 2006 LGA10150 Ashf… 1048 1162 246 2210 265 943 511 967 1910 14 0 14 199 201 400 171 #> 6 2006 LGA10200 Aubu… 2383 2516 562 4899 541 2049 1103 2174 4223 0 5 5 420 451 871 434 #> 7 2006 LGA10250 Ball… 997 1015 197 2012 191 1370 388 1384 2754 7 0 7 256 273 529 246 #> 8 2006 LGA10300 Balr… 76 87 16 163 8 82 24 93 175 0 0 0 15 15 30 18 #> 9 2006 LGA10350 Bank… 6130 6578 1240 12708 1331 5957 2571 6468 12425 14 3 17 1202 1257 2459 1189 #> 10 2006 LGA10470 Bath… 1115 1240 227 2355 276 1272 503 1337 2609 6 0 6 243 254 497 234 #> # … with 675 more rows, 290 more variables: `Age (Years) 11_males` , `Age (Years) 11_persons` , `Age (Years) 12_females` , #> # `Age (Years) 12_males` , `Age (Years) 12_persons` , `Age (Years) 13_females` , `Age (Years) 13_males` , #> # `Age (Years) 13_persons` , `Age (Years) 14_females` , `Age (Years) 14_males` , `Age (Years) 14_persons` , #> # `Age (Years) 15-19 Years_females` , `Age (Years) 15-19 Years_males` , `Age (Years) 15-19 Years_persons` , #> # `Age (Years) 15_females` , `Age (Years) 15_males` , `Age (Years) 15_persons` , `Age (Years) 16_males` , #> # `Age (Years) 16_females` , `Age (Years) 16_persons` , `Age (Years) 17_females` , `Age (Years) 17_males` , #> # `Age (Years) 17_persons` , `Age (Years) 18_females` , `Age (Years) 18_males` , `Age (Years) 18_persons` , … ``` Note: Be mindful some tables are quite large! By default, this command will store each table in a parquet file in the cache by directory if you want to stop that from happening use the argument *cache=FALSE*. Similarly, its default behaviour is to read data from cached files before extracting them from the data pack. If you want to ignore the cache, use *ignore_cache=TRUE*. ## Getting ready to use summaries Although *get_census_data()*, provides great flexibility for data processing, in most cases, it is just of interest to get just a table with a limited number of variables for analysis. For this purpose, this package provides *get_census_summary()*. For instance, if we are only interested in getting the numbers for 60-year-olds by LGA in the 2021 census (see above), we can use: ```r get_census_summary(table_number = "04", attribute = list("60 year old male"=c("Age_years_60_males","Age (Years): 60_males"), "60 year old female"=c("Age_years_60_males","Age (Years): 60_females")), geo_structure = "LGA") #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/LGA_Vic_B04_A.csv': resource busy or locked #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/LGA_Vic_B04_A.csv': resource busy or locked #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2011Census_B04A_AUST_LGA_short.csv': resource busy or locked #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2016Census_G04A_AUS_LGA.csv': resource busy or locked #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2021Census_G04A_AUST_LGA.csv': resource busy or locked #> # A tibble: 3,075 × 5 #> Census_Code Unit Year Attribute Value #> #> 1 LGA10050 Albury 2006 60 year old female 217 #> 2 LGA10050 Albury 2006 60 year old male 229 #> 3 LGA10110 Armidale Dumaresq 2006 60 year old female 122 #> 4 LGA10110 Armidale Dumaresq 2006 60 year old male 112 #> 5 LGA10150 Ashfield 2006 60 year old female 157 #> 6 LGA10150 Ashfield 2006 60 year old male 164 #> 7 LGA10200 Auburn 2006 60 year old female 210 #> 8 LGA10200 Auburn 2006 60 year old male 283 #> 9 LGA10250 Ballina 2006 60 year old female 217 #> 10 LGA10250 Ballina 2006 60 year old male 230 #> # … with 3,065 more rows ``` Finally, this function contains two more parameters of interest: *geo_units* filters the results by giving a vector of names; *reference_total* passes the function a vector of the names of another attribute which will be used as a "total" value to calculate the attributes as a percentage of said total value. For instance, if we want to narrow the above results to the LGAs of Melbourne, Stonnington and Yarra, and present the results as a percentage of the total population, we have: ```r get_census_summary(table_number = "04", attribute = list("60 year old male"=c("Age_years_60_males","Age (Years): 60_males"), "60 year old female"=c("Age_years_60_males","Age (Years): 60_females")), geo_unit_names = c("Melbourne","Stonnington","Yarra"), reference_total = list("Total"=c("Total_persons")), geo_structure = "LGA") #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2021Census_G04A_AUST_LGA.csv': resource busy or locked #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/LGA_Vic_B04_A.csv': resource busy or locked #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2011Census_B04A_AUST_LGA_short.csv': resource busy or locked #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2016Census_G04A_AUS_LGA.csv': resource busy or locked #> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2021Census_G04A_AUST_LGA.csv': resource busy or locked #> # A tibble: 15 × 7 #> Census_Code Unit Year Attribute Value Total Percentage #> #> 1 LGA24600 Melbourne 2006 60 year old female 234 71380 0.00328 #> 2 LGA24600 Melbourne 2006 60 year old male 306 71380 0.00429 #> 3 LGA26350 Stonnington 2006 60 year old female 498 89882 0.00554 #> 4 LGA26350 Stonnington 2006 60 year old male 473 89882 0.00526 #> 5 LGA27350 Yarra 2006 60 year old female 295 69330 0.00426 #> 6 LGA27350 Yarra 2006 60 year old male 266 69330 0.00384 #> 7 LGA24600 Melbourne 2011 60 year old male 313 93626 0.00334 #> 8 LGA26350 Stonnington 2011 60 year old male 431 93145 0.00463 #> 9 LGA27350 Yarra 2011 60 year old male 332 74090 0.00448 #> 10 LGA24600 Melbourne 2016 60 year old male 379 135959 0.00279 #> 11 LGA26350 Stonnington 2016 60 year old male 444 103832 0.00428 #> 12 LGA27350 Yarra 2016 60 year old male 375 86657 0.00433 #> 13 LGA24600 Melbourne 2021 60 year old male 492 149615 0.00329 #> 14 LGA26350 Stonnington 2021 60 year old male 528 104703 0.00504 #> 15 LGA27350 Yarra 2021 60 year old male 425 90114 0.00472 ``` ## Creating attributes list Sometimes lists of attributes can be very big, consolidating multiple statistics into one category (i.e. aggregate all places of birth by continent). To facilitate the process, *attribute_tibble_to_list()* can take the two first columns of a data frame and convert them to the desired format: ```r attributes <- tribble(~Census_stat, ~ Group, "Age_years_60_males","60 year old male", "Age (Years): 60_males","60 year old male", "Age_years_60_males","60 year old female", "Age (Years): 60_females","60 year old female") attribute_tibble_to_list(attributes) #> $`60 year old male` #> [1] "Age_years_60_males" "Age (Years): 60_males" #> #> $`60 year old female` #> [1] "Age_years_60_males" "Age (Years): 60_females" ```