---
title: "Auscensus - Getting Data"
resource_files:
  - vignetttes/house-preference-flow*.png
author: "Carlos YANEZ SANTIBANEZ"
date: "2023-01-14"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Auscensus - Getting Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


Once the data has been imported, the package is ready to pull data from the data packs.

## Figuring out what to retrieve

The first thing to figure out what data points are going to be retrieved. In order to answer that, the following questions need to be answered:

- Which statistics do I want to pull, i.e. which data point from which table?
- Which geographical structure do I want  (e.g. SAs, LGAs, electorates)? Do i want data for all of them or just a handful?
- For which years do I want the data?

To answer those questions, {auscensus} comes with a family of functions, all starting with *list_*.

To figure out where is the data point to retrieve, the first step to retrieve a list of available tables, which will return their code plus the year when they have been published. this can be achieved with *list_census_tables()* 


```r
list_census_tables()
#> # A tibble: 165 × 7
#>    Number `Table Name`                                                            `Table Population`                   `2021` `2016` `2011` `2006`
#>    <chr>  <chr>                                                                   <chr>                                <lgl>  <lgl>  <lgl>  <lgl> 
#>  1 01     Selected Person Characteristics by Sex                                  Persons                              TRUE   TRUE   TRUE   NA    
#>  2 01     Selected Person Characteristics by Sex (First Release Processing)       Persons                              NA     NA     NA     TRUE  
#>  3 02     Selected Medians and Averages                                           NA                                   TRUE   TRUE   TRUE   TRUE  
#>  4 03     Place of Usual Residence by Place of Enumeration on Census Night by Age Persons (excludes overseas visitors) TRUE   NA     NA     NA    
#>  5 03     Place of Usual Residence on Census Night by Age                         Persons                              NA     TRUE   TRUE   TRUE  
#>  6 04     Age by Sex                                                              Persons                              TRUE   TRUE   TRUE   TRUE  
#>  7 05     Registered Marital Status by Age by Sex                                 Persons aged 15 years and over       TRUE   TRUE   TRUE   TRUE  
#>  8 06     Social Marital Status by Age by Sex                                     Persons aged 15 years and over       TRUE   TRUE   TRUE   TRUE  
#>  9 07     Indigenous Status by Age by Sex                                         Persons                              TRUE   TRUE   TRUE   TRUE  
#> 10 08     Ancestry by Country of Birth of Parents                                 Responses and persons                TRUE   TRUE   NA     TRUE  
#> # … with 155 more rows
```

This function comes with two in-built filters. The first one allows filtering by table number, whereas the second uses a regular expression to filter the table names


```r
list_census_tables("20")
#> # A tibble: 4 × 7
#>   Number `Table Name`                                                  `Table Population`             `2021` `2016` `2011` `2006`
#>   <chr>  <chr>                                                         <chr>                          <lgl>  <lgl>  <lgl>  <lgl> 
#> 1 20     Count of Selected Long-Term Health Conditions by Age by Sex   Persons                        TRUE   NA     NA     NA    
#> 2 20     Unpaid Domestic Work:  Number of House by Age by Sex          Persons aged 15 years and over NA     TRUE   NA     NA    
#> 3 20     Unpaid Domestic Work:  Number of Hours by Age by Sex          Persons aged 15 years and over NA     NA     TRUE   NA    
#> 4 20     Unpaid Assistance to a Person with a Disability by Age by Sex Persons aged 15 years and over NA     NA     NA     TRUE
```


```r
list_census_tables(table_name_regex = "[Cc]ountry")
#> # A tibble: 6 × 7
#>   Number `Table Name`                                                                    `Table Population`    `2021` `2016` `2006` `2011`
#>   <chr>  <chr>                                                                           <chr>                 <lgl>  <lgl>  <lgl>  <lgl> 
#> 1 08     Ancestry by Country of Birth of Parents                                         Responses and persons TRUE   TRUE   TRUE   NA    
#> 2 09     Country of Birth of Person by Age by Sex                                        Persons               TRUE   TRUE   NA     NA    
#> 3 09     Country of Birth of Person by Sex                                               Persons               NA     NA     TRUE   TRUE  
#> 4 10     Country of Birth of Person by Year of Arrival in Australia                      Persons born overseas TRUE   TRUE   TRUE   TRUE  
#> 5 26     Family Composition and Country of Birth of Parents by Age of Dependent Children Dependent children    NA     TRUE   NA     NA    
#> 6 30     Family Composition and Country of Birth of Parents by Age of Dependent Children Dependent children    TRUE   NA     NA     NA
```

Please note that even though the table names have changed from year to year, all tables with the same number contain the same data.

Once the table has been selected, *list_census_attributes()* can be used to select a data point. Since tables can contain many attributes, thus this function also allows filtering the results by a regular expression. For instance, if we pick table *04* (Age by Sex) and want to find the data points for 60-year-olds, plus the total, we can use the below command.


```r
list_census_attributes(number="04",attribute_regex = "60|[Tt]otal")
#> # A tibble: 15 × 6
#>    Table Attribute                        `2021` `2016` `2011` `2006`
#>    <chr> <chr>                            <lgl>  <lgl>  <lgl>  <lgl> 
#>  1 04    Age_years_60_males               TRUE   TRUE   TRUE   NA    
#>  2 04    Age_years_60_females             TRUE   TRUE   TRUE   NA    
#>  3 04    Age_years_60_persons             TRUE   TRUE   TRUE   NA    
#>  4 04    Age_years_60_64_years_males      TRUE   TRUE   TRUE   NA    
#>  5 04    Age_years_60_64_years_females    TRUE   TRUE   TRUE   NA    
#>  6 04    Age_years_60_64_years_persons    TRUE   TRUE   TRUE   NA    
#>  7 04    Total_males                      TRUE   TRUE   TRUE   TRUE  
#>  8 04    Total_females                    TRUE   TRUE   TRUE   TRUE  
#>  9 04    Total_persons                    TRUE   TRUE   TRUE   TRUE  
#> 10 04    Age (Years): 60_males            NA     NA     NA     TRUE  
#> 11 04    Age (Years): 60_females          NA     NA     NA     TRUE  
#> 12 04    Age (Years): 60_persons          NA     NA     NA     TRUE  
#> 13 04    Age (Years): 60-64 Years_males   NA     NA     NA     TRUE  
#> 14 04    Age (Years): 60-64 Years_females NA     NA     NA     TRUE  
#> 15 04    Age (Years): 60-64 Years_persons NA     NA     NA     TRUE
```

To get which geographical structures are available, *list_census_geo_types()* can help. Once selected an appropriate value, *list_census_geo()* can be used to search for particular divisions. In the example below, we are confirming if data is available for three inner-Melbourne Local Government areas.


```r
list_census_geo_types()
#> # A tibble: 31 × 5
#>    ASGS_Structure `2021` `2016` `2011` `2006`
#>    <chr>          <lgl>  <lgl>  <lgl>  <lgl> 
#>  1 AUS            TRUE   TRUE   TRUE   TRUE  
#>  2 CD             NA     NA     NA     TRUE  
#>  3 CED            TRUE   TRUE   TRUE   TRUE  
#>  4 GCCSA          TRUE   TRUE   TRUE   NA    
#>  5 IARE           TRUE   TRUE   TRUE   TRUE  
#>  6 ILOC           TRUE   TRUE   TRUE   TRUE  
#>  7 IREG           TRUE   TRUE   TRUE   TRUE  
#>  8 JWSLA          NA     NA     NA     TRUE  
#>  9 JWSLA_extended NA     NA     NA     TRUE  
#> 10 LGA            TRUE   TRUE   TRUE   TRUE  
#> # … with 21 more rows
```


```r
list_census_geo("LGA",geo_name_regex = "[Mm]elbourne|[Ss]tonnington|Yarra")
#> # A tibble: 5 × 6
#>   ASGS_Structure Census_Name  `2021` `2016` `2011` `2006`
#>   <chr>          <chr>        <lgl>  <lgl>  <lgl>  <lgl> 
#> 1 LGA            Melbourne    TRUE   TRUE   TRUE   TRUE  
#> 2 LGA            Stonnington  TRUE   TRUE   TRUE   TRUE  
#> 3 LGA            Yarra        TRUE   TRUE   TRUE   TRUE  
#> 4 LGA            Yarra Ranges TRUE   TRUE   TRUE   TRUE  
#> 5 LGA            Yarrabah     TRUE   TRUE   TRUE   TRUE
```

Please note that the above functions only show data for the *available* data packs - if a file has not been imported, its associated year won't be presented in the results.

## Retrieving Census Data

Once the questions have been resolved, it is possible to use {auscensus} to retrieve the data with *get_census_data()*. This function will present the result for each {table, year, geography} combination as a different element within a list. By default the function does not return data but a parquet binding function - this can be changes by setting *collect_data = TRUE*. For example:


```r

data <- get_census_data(census_table = list_census_tables("04"),
                        geo_structure = "LGA",
                        collect_data = TRUE)
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/LGA_Vic_B04_A.csv': resource busy or locked 
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2011Census_B04A_AUST_LGA_short.csv': resource busy or locked 
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2016Census_G04A_AUS_LGA.csv': resource busy or locked 
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2021Census_G04A_AUST_LGA.csv': resource busy or locked
  
names(data)   
#> [1] "2006_LGA_B04" "2011_LGA_B04" "2016_LGA_G04" "2021_LGA_G04"
```


```r
data$`2006_LGA_B04`
#> # A tibble: 685 × 309
#>     Year Census_Code Unit  Age (…¹ Age (…² Age (…³ Age (…⁴ Age (…⁵ Age (…⁶ Age (…⁷ Age (…⁸ Age (…⁹ Age (…˟ Age (…˟ Age (…˟ Age (…˟ Age (…˟ Age (…˟ Age (…˟
#>    <dbl> <chr>       <chr>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>
#>  1  2006 LGA89399    Unin…    9798   10282    2121   20080    2214   10436    4335   10924   21360      16       7      23    2021    2068    4089    2094
#>  2  2006 LGA89499    No U…      21      26       4      47       6      18      10       8      26       0       0       0       3       0       3       3
#>  3  2006 LGA10050    Albu…    1490    1524     305    3014     337    1675     642    1747    3422       8       4      12     305     326     631     357
#>  4  2006 LGA10110    Armi…     660     704     144    1364     137     910     281     899    1809       0       0       0     163     157     320     173
#>  5  2006 LGA10150    Ashf…    1048    1162     246    2210     265     943     511     967    1910      14       0      14     199     201     400     171
#>  6  2006 LGA10200    Aubu…    2383    2516     562    4899     541    2049    1103    2174    4223       0       5       5     420     451     871     434
#>  7  2006 LGA10250    Ball…     997    1015     197    2012     191    1370     388    1384    2754       7       0       7     256     273     529     246
#>  8  2006 LGA10300    Balr…      76      87      16     163       8      82      24      93     175       0       0       0      15      15      30      18
#>  9  2006 LGA10350    Bank…    6130    6578    1240   12708    1331    5957    2571    6468   12425      14       3      17    1202    1257    2459    1189
#> 10  2006 LGA10470    Bath…    1115    1240     227    2355     276    1272     503    1337    2609       6       0       6     243     254     497     234
#> # … with 675 more rows, 290 more variables: `Age (Years) 11_males` <int>, `Age (Years) 11_persons` <int>, `Age (Years) 12_females` <int>,
#> #   `Age (Years) 12_males` <int>, `Age (Years) 12_persons` <int>, `Age (Years) 13_females` <int>, `Age (Years) 13_males` <int>,
#> #   `Age (Years) 13_persons` <int>, `Age (Years) 14_females` <int>, `Age (Years) 14_males` <int>, `Age (Years) 14_persons` <int>,
#> #   `Age (Years) 15-19 Years_females` <int>, `Age (Years) 15-19 Years_males` <int>, `Age (Years) 15-19 Years_persons` <int>,
#> #   `Age (Years) 15_females` <int>, `Age (Years) 15_males` <int>, `Age (Years) 15_persons` <int>, `Age (Years) 16_males` <int>,
#> #   `Age (Years) 16_females` <int>, `Age (Years) 16_persons` <int>, `Age (Years) 17_females` <int>, `Age (Years) 17_males` <int>,
#> #   `Age (Years) 17_persons` <int>, `Age (Years) 18_females` <int>, `Age (Years) 18_males` <int>, `Age (Years) 18_persons` <int>, …
```

Note: Be mindful some tables are quite large!

By default, this command will store each table in a parquet file in the cache by directory if you want to stop that from happening use the argument *cache=FALSE*. Similarly, its default behaviour is to read data from cached files before extracting them from the data pack. If you want to ignore the cache, use *ignore_cache=TRUE*.

## Getting ready to use summaries

Although *get_census_data()*, provides great flexibility for data processing, in most cases, it is just of interest to get just a table with a limited number of variables for analysis. For this purpose, this package provides *get_census_summary()*. For instance, if we are only interested in getting the numbers for 60-year-olds  by LGA in the 2021 census (see above), we can use:


```r

get_census_summary(table_number = "04",
                   attribute = list("60 year old male"=c("Age_years_60_males","Age (Years): 60_males"),
                                     "60 year old female"=c("Age_years_60_males","Age (Years): 60_females")),
                   geo_structure = "LGA")
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/LGA_Vic_B04_A.csv': resource busy or locked 
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/LGA_Vic_B04_A.csv': resource busy or locked 
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2011Census_B04A_AUST_LGA_short.csv': resource busy or locked 
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2016Census_G04A_AUS_LGA.csv': resource busy or locked 
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2021Census_G04A_AUST_LGA.csv': resource busy or locked
#> # A tibble: 3,075 × 5
#>    Census_Code Unit               Year Attribute          Value
#>    <chr>       <chr>             <dbl> <chr>              <int>
#>  1 LGA10050    Albury             2006 60 year old female   217
#>  2 LGA10050    Albury             2006 60 year old male     229
#>  3 LGA10110    Armidale Dumaresq  2006 60 year old female   122
#>  4 LGA10110    Armidale Dumaresq  2006 60 year old male     112
#>  5 LGA10150    Ashfield           2006 60 year old female   157
#>  6 LGA10150    Ashfield           2006 60 year old male     164
#>  7 LGA10200    Auburn             2006 60 year old female   210
#>  8 LGA10200    Auburn             2006 60 year old male     283
#>  9 LGA10250    Ballina            2006 60 year old female   217
#> 10 LGA10250    Ballina            2006 60 year old male     230
#> # … with 3,065 more rows
```

Finally, this function contains two more parameters of interest: *geo_units* filters the results by giving a vector of names; *reference_total* passes the function a vector of the names of another attribute which will be used as a "total" value to calculate the attributes as a percentage of said total value. For instance, if we want to narrow the above results to the LGAs of Melbourne, Stonnington and Yarra, and present the results as a percentage of the total population, we have:


```r

get_census_summary(table_number = "04",
                   attribute = list("60 year old male"=c("Age_years_60_males","Age (Years): 60_males"),
                                     "60 year old female"=c("Age_years_60_males","Age (Years): 60_females")),
                   geo_unit_names  = c("Melbourne","Stonnington","Yarra"),
                   reference_total = list("Total"=c("Total_persons")),
                   geo_structure = "LGA")
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2021Census_G04A_AUST_LGA.csv': resource busy or locked 
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/LGA_Vic_B04_A.csv': resource busy or locked 
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2011Census_B04A_AUST_LGA_short.csv': resource busy or locked 
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2016Census_G04A_AUS_LGA.csv': resource busy or locked 
#> ERROR : [EBUSY] Failed to remove 'C:/Users/carlo/OneDrive/Documents/.auscensus_cache/2021Census_G04A_AUST_LGA.csv': resource busy or locked
#> # A tibble: 15 × 7
#>    Census_Code Unit         Year Attribute          Value  Total Percentage
#>    <chr>       <chr>       <dbl> <chr>              <int>  <int>      <dbl>
#>  1 LGA24600    Melbourne    2006 60 year old female   234  71380    0.00328
#>  2 LGA24600    Melbourne    2006 60 year old male     306  71380    0.00429
#>  3 LGA26350    Stonnington  2006 60 year old female   498  89882    0.00554
#>  4 LGA26350    Stonnington  2006 60 year old male     473  89882    0.00526
#>  5 LGA27350    Yarra        2006 60 year old female   295  69330    0.00426
#>  6 LGA27350    Yarra        2006 60 year old male     266  69330    0.00384
#>  7 LGA24600    Melbourne    2011 60 year old male     313  93626    0.00334
#>  8 LGA26350    Stonnington  2011 60 year old male     431  93145    0.00463
#>  9 LGA27350    Yarra        2011 60 year old male     332  74090    0.00448
#> 10 LGA24600    Melbourne    2016 60 year old male     379 135959    0.00279
#> 11 LGA26350    Stonnington  2016 60 year old male     444 103832    0.00428
#> 12 LGA27350    Yarra        2016 60 year old male     375  86657    0.00433
#> 13 LGA24600    Melbourne    2021 60 year old male     492 149615    0.00329
#> 14 LGA26350    Stonnington  2021 60 year old male     528 104703    0.00504
#> 15 LGA27350    Yarra        2021 60 year old male     425  90114    0.00472
```

## Creating attributes list

Sometimes lists of attributes can be very big, consolidating multiple statistics into one category (i.e. aggregate all places of birth by continent).
To facilitate the process, *attribute_tibble_to_list()* can take the two first columns of a data frame and convert them to the desired format:


```r
attributes <- tribble(~Census_stat, ~ Group,
                     "Age_years_60_males","60 year old male",
                     "Age (Years): 60_males","60 year old male",
                     "Age_years_60_males","60 year old female",
                    "Age (Years): 60_females","60 year old female")

attribute_tibble_to_list(attributes)
#> $`60 year old male`
#> [1] "Age_years_60_males"    "Age (Years): 60_males"
#> 
#> $`60 year old female`
#> [1] "Age_years_60_males"      "Age (Years): 60_females"
```