I’m Senior Research Scientist with Data61 in Hobart with a background in psychology / Human-Factors. I’m interested in the convergence of human behaviour & cognition, and emerging interactive technology. Combining these areas enables innovative Human-Machine Interaction (HMI) research. My research focuses on:
HMI research with new interactive & automation technologies and decision support systems
Design of technology applications for training and healthcare
New approaches for assessing and measuring human behaviour and cognitive processes (including psychometrics, cognitive load and trust dynamics)
Online job advertisements are now being used around the world as a source of insight into the nature of our rapidly changing labour markets. However, it is important to check the quality and potential bias of this data sourcebe in order to determine what inferences are appropriate form the data or if/how we can use the data for informing decisions about labour markets. Our goal of the wider project this report is part of, is to understand the strengths and weaknesses of online job advertisements as a source of information about changing demand for workers and skills. We here present one example of the data validation and characterisation process: a comparison of two online job ad datasets: Adzuna Australia job ads data (a dataset we are using to create an online skills dashbaord) and the Internet Vacancy Index (IVI - a dataset collected by The Department of Jobs and Small Business). Both of these datasets comprise online job ads from different sources (online job ads portals).
Figure 1 compares job ads counts from both datasets over time. The IVI data applies a three-month moving average filter . The red line shows the original (unfiltered) Adzuna Australia data and the green line the same data with a three-month moving average filter (mav) applied. As expected, the latter corresponds better with the IVI data (also see Table 1). Therefore, we will only use the filtered Adzuna Australia data for further visualisation and analysis.
df_merged_long_all <- df_merged_long %>%
group_by(Month, source) %>%
# filter(source !="Adzuna_count") %>%
summarise(count = sum(count))
p <- ggplot(df_merged_long_all, aes(x = Month, y = count, colour = source)) +
geom_line() +
geom_point(shape = 1) +
labs(title = "Comparison of Adzuna Australia and IVI job ad counts between 2015 - 2019",
x = "Date", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5),
text = element_text(size=7)) +
##add animation
transition_reveal(Month)
##stop animation after frist loop
animate(p, renderer = gifski_renderer(loop = F))
Figure 1: Adzuna Australia and IVI job ad counts - collapsed over all occupations and GCCSAs
cor_collapsed <- round(cor(df_merged[, c(4,5,7)]),2)
knitr::kable(cor_collapsed, format = "html", caption = "Correlations Adzuna Australia and IVI job ads counts over time, all occupations and GCCSAs") %>%
kable_styling("striped", full_width = F)
Adzuna_count | Adzuna_mav_count | IVI_count | |
---|---|---|---|
Adzuna_count | 1.00 | 0.99 | 0.92 |
Adzuna_mav_count | 0.99 | 1.00 | 0.93 |
IVI_count | 0.92 | 0.93 | 1.00 |
To get another perspective of the comparison the counts from the two datasets over time, we scatter plot the data and fit a LOESS curve (locally estimated scatterplot smoothing) to assess temporal trends. Figure 2 shows data per GCCSA (Greater Capital City Statistical Area) for all States and Territories.
df_merged_long_GCCSA <- df_merged_long %>%
filter(source != "Adzuna_count") %>%
group_by(Month, GCCSA, source) %>%
summarise(count = sum(count))
ggplot(df_merged_long_GCCSA, aes(x = Month, y = count, colour = source)) +
geom_point(shape = 1, size = 1, alpha = 0.7) +
geom_smooth(method = 'loess') +
labs(title = "Comparison of Adzuna Australia and IVI job ads counts per GCCSA (all occupations)",
x = "Date", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5),
text = element_text(size=7)) +
facet_wrap(~GCCSA, scales = "free_y")
Figure 2: Adzuna Australia and IVI job ads counts - GCCSA
Table 2 shows correlation between Adzuna Australia and IVI job ads counts per GCCSA over time and all occupations.
df_merged_GCCSA <- df_merged %>%
group_by(Month, GCCSA) %>%
summarise(Adzuna_mav_count = sum(Adzuna_mav_count),
IVI_count = sum(IVI_count))
correlations_GCCSA <- data.frame(state=character(),
capital_city=numeric(),
rest_of_state=numeric(),
stringsAsFactors=FALSE)
correlations_GCCSA[1,1] <- "NSW"
correlations_GCCSA[1,2] <- with(df_merged_GCCSA %>% filter(GCCSA == "Greater Sydney"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[1,3] <- with(df_merged_GCCSA %>% filter(GCCSA == "Rest of NSW"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[2,1] <- "VIC"
correlations_GCCSA[2,2] <- with(df_merged_GCCSA %>% filter(GCCSA == "Greater Melbourne"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[2,3] <- with(df_merged_GCCSA %>% filter(GCCSA == "Rest of Vic."), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[3,1] <- "QLD"
correlations_GCCSA[3,2] <- with(df_merged_GCCSA %>% filter(GCCSA == "Greater Brisbane"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[3,3] <- with(df_merged_GCCSA %>% filter(GCCSA == "Rest of Qld"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[4,1] <- "SA"
correlations_GCCSA[4,2] <- with(df_merged_GCCSA %>% filter(GCCSA == "Greater Adelaide"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[4,3] <- with(df_merged_GCCSA %>% filter(GCCSA == "Rest of SA"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[5,1] <- "WA"
correlations_GCCSA[5,2] <- with(df_merged_GCCSA %>% filter(GCCSA == "Greater Perth"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[5,3] <- with(df_merged_GCCSA %>% filter(GCCSA == "Rest of WA"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[6,1] <- "TAS"
correlations_GCCSA[6,2] <- with(df_merged_GCCSA %>% filter(GCCSA == "Greater Hobart"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[6,3] <- with(df_merged_GCCSA %>% filter(GCCSA == "Rest of Tas."), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[7,1] <- "NT"
correlations_GCCSA[7,2] <- with(df_merged_GCCSA %>% filter(GCCSA == "Greater Darwin"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[7,3] <- with(df_merged_GCCSA %>% filter(GCCSA == "Rest of NT"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_GCCSA[8,1] <- "ATC"
correlations_GCCSA[8,2] <- with(df_merged_GCCSA %>% filter(GCCSA == "Australian Capital Territory"), round(cor(Adzuna_mav_count, IVI_count), 2))
knitr::kable(correlations_GCCSA, format = "html", caption = "Correlations Adzuna Australia and IVI job ads counts - GCCSAs") %>%
kable_styling("striped", full_width = F)
state | capital_city | rest_of_state |
---|---|---|
NSW | 0.19 | 0.51 |
VIC | 0.06 | 0.77 |
QLD | 0.17 | 0.20 |
SA | 0.27 | 0.37 |
WA | -0.14 | 0.78 |
TAS | 0.11 | 0.76 |
NT | 0.09 | 0.58 |
ATC | 0.39 | NA |
Figure 3 compares counts of Adzuna Australia and IVI job ads per major occupational group (unsing ANZCO - Australian and New Zealand Standard Classification of Occupations) over time and all GCCSAs. Again, we scatter plot the data and fit a LOESS curve.
df_merged_long_occupation <- df_merged_long %>%
filter(source != "Adzuna_count") %>%
group_by(Month, ANZSCO_TITLE, source) %>%
summarise(count = sum(count))
ggplot(df_merged_long_occupation, aes(x = Month, y = count, colour = source)) +
geom_point(shape = 1, size = 1, alpha = 0.7) +
geom_smooth(method = 'loess') +
labs(title = "Comparison of Adzuna Australia and IVI job ads counts per occupation (all GCCSAs)",
x = "Date", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5),
text = element_text(size=7)) +
facet_wrap(~ANZSCO_TITLE, scales = "free_y")
Figure 3: IVI and Adzuna job ads count - Occupations
Table 3 shows correlation between Adzuna Australia and IVI job ads counts per major ANZCO occupational group over time and all GCCSAs.
df_merged_occupation <- df_merged %>%
group_by(Month, ANZSCO_TITLE) %>%
summarise(Adzuna_mav_count = sum(Adzuna_mav_count),
IVI_count = sum(IVI_count))
correlations_occupation <- data.frame("Occupation"=character(),
"Correlation"=numeric(),
stringsAsFactors=FALSE)
correlations_occupation[1,1] <- "Managers"
correlations_occupation[1,2] <- with(df_merged_occupation %>% filter(ANZSCO_TITLE == "Managers"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_occupation[2,1] <- "Professionals"
correlations_occupation[2,2] <- with(df_merged_occupation %>% filter(ANZSCO_TITLE == "Professionals"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_occupation[3,1] <- "Technicians and Trades Workers"
correlations_occupation[3,2] <- with(df_merged_occupation %>% filter(ANZSCO_TITLE == "Technicians and Trades Workers"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_occupation[4,1] <- "Community and Personal Service Workers"
correlations_occupation[4,2] <- with(df_merged_occupation %>% filter(ANZSCO_TITLE == "Community and Personal Service Workers"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_occupation[5,1] <- "Clerical and Administrative Workers"
correlations_occupation[5,2] <- with(df_merged_occupation %>% filter(ANZSCO_TITLE == "Clerical and Administrative Workers"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_occupation[6,1] <- "Sales Workers"
correlations_occupation[6,2] <- with(df_merged_occupation %>% filter(ANZSCO_TITLE == "Sales Workers"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_occupation[7,1] <- "Machinery Operators and Drivers"
correlations_occupation[7,2] <- with(df_merged_occupation %>% filter(ANZSCO_TITLE == "Machinery Operators and Drivers"), round(cor(Adzuna_mav_count, IVI_count), 2))
correlations_occupation[8,1] <- "Labourers"
correlations_occupation[8,2] <- with(df_merged_occupation %>% filter(ANZSCO_TITLE == "Labourers"), round(cor(Adzuna_mav_count, IVI_count), 2))
knitr::kable(correlations_occupation, format = "html", caption = "Correlations Adzuna Australia and IVI job ads counts - occupation") %>%
kable_styling("striped", full_width = F)
Occupation | Correlation |
---|---|
Managers | 0.33 |
Professionals | 0.39 |
Technicians and Trades Workers | -0.38 |
Community and Personal Service Workers | -0.49 |
Clerical and Administrative Workers | 0.27 |
Sales Workers | 0.51 |
Machinery Operators and Drivers | -0.01 |
Labourers | 0.03 |
Next we plot detailed correlations between the datasets for all GCCSAs and occupations (see Figure 4) over time.
df_merged %>%
group_by(GCCSA, ANZSCO_TITLE) %>%
summarise(correlation = cor(Adzuna_mav_count, IVI_count)) %>%
mutate(correlation = round(correlation, 2)) %>%
ggplot(aes(x = ANZSCO_TITLE, y = correlation)) +
geom_bar(stat="identity", alpha = 0.5) +
geom_text(aes(label = correlation), nudge_y = 0.2, colour = "blue", size = 1.8) +
theme(axis.text.x = element_text(angle = 75, hjust = 1),
plot.title = element_text(hjust = 0.5),
text = element_text(size=7)) +
xlab("Occupation") +
ylab("Correlation") +
facet_wrap(~GCCSA)
Figure 4: Adzuna Australia ans IVI job ads count detailed correlations
And finally, we plot overall counts over time in both datasets for each GCCSA and occupation (see 5).
df_merged_long %>%
filter(source != "Adzuna_count")%>%
group_by(GCCSA, ANZSCO_TITLE, source) %>%
summarise(count = sum(count)) %>%
ggplot(aes(x=GCCSA, y = ANZSCO_TITLE, colour = source))+
geom_jitter(aes(size = count), alpha = 0.5, width = 0.1, height = 0.1) +
#geom_point(aes(size = count), alpha = 0.3) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5),
text = element_text(size=7)) +
ylab("Occupation")
Figure 5: Overall Adzuna Australia and job ads counts - GCCSAs and Occupations
The main take-away points from the above data comparisons are:
Low correlation between datasets over time within a given geographic location. Correlations for rst of state are generally higher than for capital city areas.
Correlations for occupations vary widely from hight to low. We see high to moderate correlations over time for sales workers, professionals, managers, and clerical and administrative workers; no correlation for machinery operators and drivers, and negative correlations for technicians and trades workers as well as community and personal service workers. This would suggest that, taking IVI as a reference, sales, professional and managerial roles are being captured better by Adzuna Australia than other occupations. However, especially the negative correlations are surprising and need to be investigated further.
When drilling down into occupations by GCCSA over time it becomes more difficult interpret emerging patterns. For some of these combinations the data correspoens better than for others. However, such detailed data might still be valuable for some and assist in informing how confident we can be in such datasources.
…by mostly with trying to figure out nitty gritty details, deciphering error messages, and why some of my code did not work. And continuous iterations of tidying, visualising, analysing, tidying…
My next challenge (apart from honing my R skills) will be getting into Python.
Although it was a considerable time investment and often challenging with managing project and other work, attending Data School gave me time to practise my (emerging) R skills.