using CSV
using DataFrames
using Statistics
using Chain
#note that the missingstring arg will read in any "NA" strings as missings
= CSV.read(download("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv"), DataFrame, missingstring="NA"
penguins );
DataFrames.jl Cheat Sheet
A WIP cheat sheet for working with DataFrames.jl
A cheat sheet for using DataFrames.jl. This is a work in progress and will (hopefully) be continuously updated/modified.
Load Pkgs and Read in data
Check Properties of data
size(penguins)
(344, 8)
see the first 3 rows
first(penguins, 3)
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|---|
String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | |
1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
get the column names of a dataframe
names(penguins)
8-element Vector{String}:
"species"
"island"
"bill_length_mm"
"bill_depth_mm"
"flipper_length_mm"
"body_mass_g"
"sex"
"year"
nb we could also convert these to Symbol
via:
Symbol.(names(penguins))
8-element Vector{Symbol}:
:species
:island
:bill_length_mm
:bill_depth_mm
:flipper_length_mm
:body_mass_g
:sex
:year
If we want to get the column types, we can do the following:
eltype.(eachcol(penguins))
8-element Vector{Type}:
String15
String15
Union{Missing, Float64}
Union{Missing, Float64}
Union{Missing, Int64}
Union{Missing, Int64}
Union{Missing, String7}
Int64
Subsetting Rows
In general, we’ll want to use the subset()
or subset!()
functions to subset rows in a dataframe (subset!()
is the in-place version)
The syntax is a little wonky, and it tends to involve passing a symbol into an anonymous function that is then broadcast. For instance, the following will keep only the Chinstrap penguins:
subset(penguins, :species => s -> s .== "Chinstrap")
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|---|
String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | |
1 | Chinstrap | Dream | 46.5 | 17.9 | 192 | 3500 | female | 2007 |
2 | Chinstrap | Dream | 50.0 | 19.5 | 196 | 3900 | male | 2007 |
3 | Chinstrap | Dream | 51.3 | 19.2 | 193 | 3650 | male | 2007 |
4 | Chinstrap | Dream | 45.4 | 18.7 | 188 | 3525 | female | 2007 |
5 | Chinstrap | Dream | 52.7 | 19.8 | 197 | 3725 | male | 2007 |
6 | Chinstrap | Dream | 45.2 | 17.8 | 198 | 3950 | female | 2007 |
7 | Chinstrap | Dream | 46.1 | 18.2 | 178 | 3250 | female | 2007 |
8 | Chinstrap | Dream | 51.3 | 18.2 | 197 | 3750 | male | 2007 |
9 | Chinstrap | Dream | 46.0 | 18.9 | 195 | 4150 | female | 2007 |
10 | Chinstrap | Dream | 51.3 | 19.9 | 198 | 3700 | male | 2007 |
11 | Chinstrap | Dream | 46.6 | 17.8 | 193 | 3800 | female | 2007 |
12 | Chinstrap | Dream | 51.7 | 20.3 | 194 | 3775 | male | 2007 |
13 | Chinstrap | Dream | 47.0 | 17.3 | 185 | 3700 | female | 2007 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
57 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 |
58 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 |
59 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
60 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 |
61 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
62 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 |
63 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 |
64 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
65 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 |
66 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 |
67 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
68 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
And it’s straightforward to pass multiple conditions to the function:
subset(penguins, :species => s -> s .== "Chinstrap", :bill_length_mm => b -> b .> 50, skipmissing=true)
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|---|
String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | |
1 | Chinstrap | Dream | 51.3 | 19.2 | 193 | 3650 | male | 2007 |
2 | Chinstrap | Dream | 52.7 | 19.8 | 197 | 3725 | male | 2007 |
3 | Chinstrap | Dream | 51.3 | 18.2 | 197 | 3750 | male | 2007 |
4 | Chinstrap | Dream | 51.3 | 19.9 | 198 | 3700 | male | 2007 |
5 | Chinstrap | Dream | 51.7 | 20.3 | 194 | 3775 | male | 2007 |
6 | Chinstrap | Dream | 52.0 | 18.1 | 201 | 4050 | male | 2007 |
7 | Chinstrap | Dream | 50.5 | 19.6 | 201 | 4050 | male | 2007 |
8 | Chinstrap | Dream | 50.3 | 20.0 | 197 | 3300 | male | 2007 |
9 | Chinstrap | Dream | 58.0 | 17.8 | 181 | 3700 | female | 2007 |
10 | Chinstrap | Dream | 50.6 | 19.4 | 193 | 3800 | male | 2007 |
11 | Chinstrap | Dream | 52.0 | 19.0 | 197 | 4150 | male | 2007 |
12 | Chinstrap | Dream | 50.5 | 18.4 | 200 | 3400 | female | 2008 |
13 | Chinstrap | Dream | 52.8 | 20.0 | 205 | 4550 | male | 2008 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
19 | Chinstrap | Dream | 50.9 | 17.9 | 196 | 3675 | female | 2009 |
20 | Chinstrap | Dream | 50.8 | 18.5 | 201 | 4450 | male | 2009 |
21 | Chinstrap | Dream | 50.1 | 17.9 | 190 | 3400 | female | 2009 |
22 | Chinstrap | Dream | 51.5 | 18.7 | 187 | 3250 | male | 2009 |
23 | Chinstrap | Dream | 51.4 | 19.0 | 201 | 3950 | male | 2009 |
24 | Chinstrap | Dream | 50.7 | 19.7 | 203 | 4050 | male | 2009 |
25 | Chinstrap | Dream | 52.2 | 18.8 | 197 | 3450 | male | 2009 |
26 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
27 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
28 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
29 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
30 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
Note the use of the skipmissing
argument. If there are missing values in a column that’s being filtered on, this will throw an error
We can also drop rows with any missing data via dropmissing(df)
dropmissing(penguins)
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|---|
String15 | String15 | Float64 | Float64 | Int64 | Int64 | String7 | Int64 | |
1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
6 | Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 |
7 | Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 |
8 | Adelie | Torgersen | 41.1 | 17.6 | 182 | 3200 | female | 2007 |
9 | Adelie | Torgersen | 38.6 | 21.2 | 191 | 3800 | male | 2007 |
10 | Adelie | Torgersen | 34.6 | 21.1 | 198 | 4400 | male | 2007 |
11 | Adelie | Torgersen | 36.6 | 17.8 | 185 | 3700 | female | 2007 |
12 | Adelie | Torgersen | 38.7 | 19.0 | 195 | 3450 | female | 2007 |
13 | Adelie | Torgersen | 42.5 | 20.7 | 197 | 4500 | male | 2007 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
322 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 |
323 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 |
324 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
325 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 |
326 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
327 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 |
328 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 |
329 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
330 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 |
331 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 |
332 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
333 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
Or we can choose to drop rows with missing values on a specific column:
dropmissing(penguins, :bill_length_mm)
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|---|
String15 | String15 | Float64 | Float64? | Int64? | Int64? | String7? | Int64 | |
1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
6 | Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 |
7 | Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 |
8 | Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | missing | 2007 |
9 | Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | missing | 2007 |
10 | Adelie | Torgersen | 37.8 | 17.1 | 186 | 3300 | missing | 2007 |
11 | Adelie | Torgersen | 37.8 | 17.3 | 180 | 3700 | missing | 2007 |
12 | Adelie | Torgersen | 41.1 | 17.6 | 182 | 3200 | female | 2007 |
13 | Adelie | Torgersen | 38.6 | 21.2 | 191 | 3800 | male | 2007 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
331 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 |
332 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 |
333 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
334 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 |
335 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
336 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 |
337 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 |
338 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
339 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 |
340 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 |
341 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
342 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
We can also subset rows using regex:
subset(penguins, :species => ByRow(x -> occursin(r"^Chin", x)))
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|---|
String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | |
1 | Chinstrap | Dream | 46.5 | 17.9 | 192 | 3500 | female | 2007 |
2 | Chinstrap | Dream | 50.0 | 19.5 | 196 | 3900 | male | 2007 |
3 | Chinstrap | Dream | 51.3 | 19.2 | 193 | 3650 | male | 2007 |
4 | Chinstrap | Dream | 45.4 | 18.7 | 188 | 3525 | female | 2007 |
5 | Chinstrap | Dream | 52.7 | 19.8 | 197 | 3725 | male | 2007 |
6 | Chinstrap | Dream | 45.2 | 17.8 | 198 | 3950 | female | 2007 |
7 | Chinstrap | Dream | 46.1 | 18.2 | 178 | 3250 | female | 2007 |
8 | Chinstrap | Dream | 51.3 | 18.2 | 197 | 3750 | male | 2007 |
9 | Chinstrap | Dream | 46.0 | 18.9 | 195 | 4150 | female | 2007 |
10 | Chinstrap | Dream | 51.3 | 19.9 | 198 | 3700 | male | 2007 |
11 | Chinstrap | Dream | 46.6 | 17.8 | 193 | 3800 | female | 2007 |
12 | Chinstrap | Dream | 51.7 | 20.3 | 194 | 3775 | male | 2007 |
13 | Chinstrap | Dream | 47.0 | 17.3 | 185 | 3700 | female | 2007 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
57 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 |
58 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 |
59 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
60 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 |
61 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
62 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 |
63 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 |
64 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
65 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 |
66 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 |
67 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
68 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
This uses the ByRow()
function to apply a function to each row as opposed to broadcasting, but the broadcasting notation also works
subset(penguins, :species => x -> occursin.(r"^Chin", x))
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|---|
String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | |
1 | Chinstrap | Dream | 46.5 | 17.9 | 192 | 3500 | female | 2007 |
2 | Chinstrap | Dream | 50.0 | 19.5 | 196 | 3900 | male | 2007 |
3 | Chinstrap | Dream | 51.3 | 19.2 | 193 | 3650 | male | 2007 |
4 | Chinstrap | Dream | 45.4 | 18.7 | 188 | 3525 | female | 2007 |
5 | Chinstrap | Dream | 52.7 | 19.8 | 197 | 3725 | male | 2007 |
6 | Chinstrap | Dream | 45.2 | 17.8 | 198 | 3950 | female | 2007 |
7 | Chinstrap | Dream | 46.1 | 18.2 | 178 | 3250 | female | 2007 |
8 | Chinstrap | Dream | 51.3 | 18.2 | 197 | 3750 | male | 2007 |
9 | Chinstrap | Dream | 46.0 | 18.9 | 195 | 4150 | female | 2007 |
10 | Chinstrap | Dream | 51.3 | 19.9 | 198 | 3700 | male | 2007 |
11 | Chinstrap | Dream | 46.6 | 17.8 | 193 | 3800 | female | 2007 |
12 | Chinstrap | Dream | 51.7 | 20.3 | 194 | 3775 | male | 2007 |
13 | Chinstrap | Dream | 47.0 | 17.3 | 185 | 3700 | female | 2007 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
57 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 |
58 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 |
59 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 |
60 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 |
61 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 |
62 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 |
63 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 |
64 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
65 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 |
66 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 |
67 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
68 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
Finally, note that we can also use the bracket syntax with row indices to filter, but this feels like it’s not the most useful for rows:
#select rows 1, 10, 15
1, 10, 15], :] penguins[[
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|---|
String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | |
1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
2 | Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | missing | 2007 |
3 | Adelie | Torgersen | 34.6 | 21.1 | 198 | 4400 | male | 2007 |
Subsetting Columns
Using bracket syntax to subset columns makes more sense:
:, 1:3] penguins[
Row | species | island | bill_length_mm |
---|---|---|---|
String15 | String15 | Float64? | |
1 | Adelie | Torgersen | 39.1 |
2 | Adelie | Torgersen | 39.5 |
3 | Adelie | Torgersen | 40.3 |
4 | Adelie | Torgersen | missing |
5 | Adelie | Torgersen | 36.7 |
6 | Adelie | Torgersen | 39.3 |
7 | Adelie | Torgersen | 38.9 |
8 | Adelie | Torgersen | 39.2 |
9 | Adelie | Torgersen | 34.1 |
10 | Adelie | Torgersen | 42.0 |
11 | Adelie | Torgersen | 37.8 |
12 | Adelie | Torgersen | 37.8 |
13 | Adelie | Torgersen | 41.1 |
⋮ | ⋮ | ⋮ | ⋮ |
333 | Chinstrap | Dream | 45.2 |
334 | Chinstrap | Dream | 49.3 |
335 | Chinstrap | Dream | 50.2 |
336 | Chinstrap | Dream | 45.6 |
337 | Chinstrap | Dream | 51.9 |
338 | Chinstrap | Dream | 46.8 |
339 | Chinstrap | Dream | 45.7 |
340 | Chinstrap | Dream | 55.8 |
341 | Chinstrap | Dream | 43.5 |
342 | Chinstrap | Dream | 49.6 |
343 | Chinstrap | Dream | 50.8 |
344 | Chinstrap | Dream | 50.2 |
We can also pass strings or symbols when using the bracket subsetting syntax:
:, [:species, :island]] penguins[
Row | species | island |
---|---|---|
String15 | String15 | |
1 | Adelie | Torgersen |
2 | Adelie | Torgersen |
3 | Adelie | Torgersen |
4 | Adelie | Torgersen |
5 | Adelie | Torgersen |
6 | Adelie | Torgersen |
7 | Adelie | Torgersen |
8 | Adelie | Torgersen |
9 | Adelie | Torgersen |
10 | Adelie | Torgersen |
11 | Adelie | Torgersen |
12 | Adelie | Torgersen |
13 | Adelie | Torgersen |
⋮ | ⋮ | ⋮ |
333 | Chinstrap | Dream |
334 | Chinstrap | Dream |
335 | Chinstrap | Dream |
336 | Chinstrap | Dream |
337 | Chinstrap | Dream |
338 | Chinstrap | Dream |
339 | Chinstrap | Dream |
340 | Chinstrap | Dream |
341 | Chinstrap | Dream |
342 | Chinstrap | Dream |
343 | Chinstrap | Dream |
344 | Chinstrap | Dream |
:, ["species", "island"]] penguins[
Row | species | island |
---|---|---|
String15 | String15 | |
1 | Adelie | Torgersen |
2 | Adelie | Torgersen |
3 | Adelie | Torgersen |
4 | Adelie | Torgersen |
5 | Adelie | Torgersen |
6 | Adelie | Torgersen |
7 | Adelie | Torgersen |
8 | Adelie | Torgersen |
9 | Adelie | Torgersen |
10 | Adelie | Torgersen |
11 | Adelie | Torgersen |
12 | Adelie | Torgersen |
13 | Adelie | Torgersen |
⋮ | ⋮ | ⋮ |
333 | Chinstrap | Dream |
334 | Chinstrap | Dream |
335 | Chinstrap | Dream |
336 | Chinstrap | Dream |
337 | Chinstrap | Dream |
338 | Chinstrap | Dream |
339 | Chinstrap | Dream |
340 | Chinstrap | Dream |
341 | Chinstrap | Dream |
342 | Chinstrap | Dream |
343 | Chinstrap | Dream |
344 | Chinstrap | Dream |
An alternative way to subset columns is to use the select()
function from DataFrames.jl
select(penguins, [:species, :island])
Row | species | island |
---|---|---|
String15 | String15 | |
1 | Adelie | Torgersen |
2 | Adelie | Torgersen |
3 | Adelie | Torgersen |
4 | Adelie | Torgersen |
5 | Adelie | Torgersen |
6 | Adelie | Torgersen |
7 | Adelie | Torgersen |
8 | Adelie | Torgersen |
9 | Adelie | Torgersen |
10 | Adelie | Torgersen |
11 | Adelie | Torgersen |
12 | Adelie | Torgersen |
13 | Adelie | Torgersen |
⋮ | ⋮ | ⋮ |
333 | Chinstrap | Dream |
334 | Chinstrap | Dream |
335 | Chinstrap | Dream |
336 | Chinstrap | Dream |
337 | Chinstrap | Dream |
338 | Chinstrap | Dream |
339 | Chinstrap | Dream |
340 | Chinstrap | Dream |
341 | Chinstrap | Dream |
342 | Chinstrap | Dream |
343 | Chinstrap | Dream |
344 | Chinstrap | Dream |
Just as with subsetting rows, we can also subset columns using regex. And the column subsetting is way easier than row subsetting:
select(penguins, r"bill")
Row | bill_length_mm | bill_depth_mm |
---|---|---|
Float64? | Float64? | |
1 | 39.1 | 18.7 |
2 | 39.5 | 17.4 |
3 | 40.3 | 18.0 |
4 | missing | missing |
5 | 36.7 | 19.3 |
6 | 39.3 | 20.6 |
7 | 38.9 | 17.8 |
8 | 39.2 | 19.6 |
9 | 34.1 | 18.1 |
10 | 42.0 | 20.2 |
11 | 37.8 | 17.1 |
12 | 37.8 | 17.3 |
13 | 41.1 | 17.6 |
⋮ | ⋮ | ⋮ |
333 | 45.2 | 16.6 |
334 | 49.3 | 19.9 |
335 | 50.2 | 18.8 |
336 | 45.6 | 19.4 |
337 | 51.9 | 19.5 |
338 | 46.8 | 16.5 |
339 | 45.7 | 17.0 |
340 | 55.8 | 19.8 |
341 | 43.5 | 18.1 |
342 | 49.6 | 18.2 |
343 | 50.8 | 19.0 |
344 | 50.2 | 18.7 |
Creating New Columns
We can create new columns using the transform()
function (or transform!()
for in-place).
The general pattern here is that creating a column takes 3 components:
- an input column,
- a function to apply to that column (usually wrapped in
ByRow()
), and - the name of the output column
transform(penguins, :species => ByRow(x -> x == "Chinstrap") => :is_chinstrap)
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | is_chinstrap |
---|---|---|---|---|---|---|---|---|---|
String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | Bool | |
1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 | false |
2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 | false |
3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 | false |
4 | Adelie | Torgersen | missing | missing | missing | missing | missing | 2007 | false |
5 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 | false |
6 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 | false |
7 | Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 | false |
8 | Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 | false |
9 | Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | missing | 2007 | false |
10 | Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | missing | 2007 | false |
11 | Adelie | Torgersen | 37.8 | 17.1 | 186 | 3300 | missing | 2007 | false |
12 | Adelie | Torgersen | 37.8 | 17.3 | 180 | 3700 | missing | 2007 | false |
13 | Adelie | Torgersen | 41.1 | 17.6 | 182 | 3200 | female | 2007 | false |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
333 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 | true |
334 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 | true |
335 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 | true |
336 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 | true |
337 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 | true |
338 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 | true |
339 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 | true |
340 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 | true |
341 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 | true |
342 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 | true |
343 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 | true |
344 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 | true |
We can also create multiple columns at the same time:
transform(
penguins,:species => ByRow(x -> x == "Chinstrap") => :is_chinstrap,
:body_mass_g => ByRow(x -> x > 4000) => :chonk
)
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | is_chinstrap | chonk |
---|---|---|---|---|---|---|---|---|---|---|
String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | Bool | Bool? | |
1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 | false | false |
2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 | false | false |
3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 | false | false |
4 | Adelie | Torgersen | missing | missing | missing | missing | missing | 2007 | false | missing |
5 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 | false | false |
6 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 | false | false |
7 | Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 | false | false |
8 | Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 | false | true |
9 | Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | missing | 2007 | false | false |
10 | Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | missing | 2007 | false | true |
11 | Adelie | Torgersen | 37.8 | 17.1 | 186 | 3300 | missing | 2007 | false | false |
12 | Adelie | Torgersen | 37.8 | 17.3 | 180 | 3700 | missing | 2007 | false | false |
13 | Adelie | Torgersen | 41.1 | 17.6 | 182 | 3200 | female | 2007 | false | false |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
333 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 | true | false |
334 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 | true | true |
335 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 | true | false |
336 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 | true | false |
337 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 | true | false |
338 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 | true | false |
339 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 | true | false |
340 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 | true | false |
341 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 | true | false |
342 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 | true | false |
343 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 | true | true |
344 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 | true | false |
We can pass multiple columns into an operation as well:
transform(
penguins,:bill_length_mm, :bill_depth_mm] => (+) => :bill_stuff
[ )
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | bill_stuff |
---|---|---|---|---|---|---|---|---|---|
String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | Int64 | Float64? | |
1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 | 57.8 |
2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 | 56.9 |
3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 | 58.3 |
4 | Adelie | Torgersen | missing | missing | missing | missing | missing | 2007 | missing |
5 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 | 56.0 |
6 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 | 59.9 |
7 | Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 | 56.7 |
8 | Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 | 58.8 |
9 | Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | missing | 2007 | 52.2 |
10 | Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | missing | 2007 | 62.2 |
11 | Adelie | Torgersen | 37.8 | 17.1 | 186 | 3300 | missing | 2007 | 54.9 |
12 | Adelie | Torgersen | 37.8 | 17.3 | 180 | 3700 | missing | 2007 | 55.1 |
13 | Adelie | Torgersen | 41.1 | 17.6 | 182 | 3200 | female | 2007 | 58.7 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
333 | Chinstrap | Dream | 45.2 | 16.6 | 191 | 3250 | female | 2009 | 61.8 |
334 | Chinstrap | Dream | 49.3 | 19.9 | 203 | 4050 | male | 2009 | 69.2 |
335 | Chinstrap | Dream | 50.2 | 18.8 | 202 | 3800 | male | 2009 | 69.0 |
336 | Chinstrap | Dream | 45.6 | 19.4 | 194 | 3525 | female | 2009 | 65.0 |
337 | Chinstrap | Dream | 51.9 | 19.5 | 206 | 3950 | male | 2009 | 71.4 |
338 | Chinstrap | Dream | 46.8 | 16.5 | 189 | 3650 | female | 2009 | 63.3 |
339 | Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 | 62.7 |
340 | Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 | 75.6 |
341 | Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 | 61.6 |
342 | Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 | 67.8 |
343 | Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 | 69.8 |
344 | Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 | 68.9 |
This syntax isn’t my favorite. Another way to do the same, though, if we wanted:
:bill_stuff = penguins.:bill_length_mm .+ penguins.:bill_depth_mm penguins.
344-element Vector{Union{Missing, Float64}}:
57.8
56.9
58.3
missing
56.0
59.9
56.7
58.800000000000004
52.2
62.2
54.9
55.099999999999994
58.7
⋮
61.800000000000004
69.19999999999999
69.0
65.0
71.4
63.3
62.7
75.6
61.6
67.8
69.8
68.9
Summarizing DataFrames
Often, summarizing will involve combining the groupby()
method with the combine()
method:
# we need to drop missings here on the body mass column first
= groupby(dropmissing(penguins, :body_mass_g), :species)
grouped_penguins
combine(grouped_penguins, :body_mass_g => mean)
Row | species | body_mass_g_mean |
---|---|---|
String15 | Float64 | |
1 | Adelie | 3700.66 |
2 | Gentoo | 5076.02 |
3 | Chinstrap | 3733.09 |
Counting is a common operation:
combine(grouped_penguins, nrow)
Row | species | nrow |
---|---|---|
String15 | Int64 | |
1 | Adelie | 151 |
2 | Gentoo | 123 |
3 | Chinstrap | 68 |
The problem here is that we’ve already dropped missing values on body mass, so we’re not counting all of the observations here. I’m not sure if there’s a great way around this inside of combine()
, but something to explore more…
Anyway, say we want to take the mean of all of our numeric columns. We can do that as follows. Note that I don’t really like this syntax, especially compared to some of the way you’d do this in tidyverse
or even in Python’s polars
. It also has the issue that I’m having to keep complete observations rather than column-wise complete data.
= names(penguins, Union{Missing,Number})
numeric_cols
= dropmissing(penguins)
nomiss_penguins
= groupby(nomiss_penguins, :species)
grouped_penguins
= combine(grouped_penguins, numeric_cols .=> mean) species_means
Row | species | bill_length_mm_mean | bill_depth_mm_mean | flipper_length_mm_mean | body_mass_g_mean | year_mean | bill_stuff_mean |
---|---|---|---|---|---|---|---|
String15 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | Adelie | 38.824 | 18.3473 | 190.103 | 3706.16 | 2008.05 | 57.1712 |
2 | Gentoo | 47.5681 | 14.9966 | 217.235 | 5092.44 | 2008.07 | 62.5647 |
3 | Chinstrap | 48.8338 | 18.4206 | 195.824 | 3733.09 | 2007.97 | 67.2544 |
We can also perform multiple aggregations on a grouped dataframe, just like we did when we applied multiple transformations via transform()
:
combine(
grouped_penguins,:bill_length_mm => mean,
=> :n
nrow )
Row | species | bill_length_mm_mean | n |
---|---|---|---|
String15 | Float64 | Int64 | |
1 | Adelie | 38.824 | 146 |
2 | Gentoo | 47.5681 | 119 |
3 | Chinstrap | 48.8338 | 68 |
Reshaping data
Let’s go back to our species_means
data from before. Say we want to pivot this longer, so that we have a 3-column dataframe where our columns are species, metric, and value. We can do this with the stack()
function:
= stack(species_means, Not(:species), variable_name=:metric, value_name=:value) species_means_long
Row | species | metric | value |
---|---|---|---|
String15 | String | Float64 | |
1 | Adelie | bill_length_mm_mean | 38.824 |
2 | Gentoo | bill_length_mm_mean | 47.5681 |
3 | Chinstrap | bill_length_mm_mean | 48.8338 |
4 | Adelie | bill_depth_mm_mean | 18.3473 |
5 | Gentoo | bill_depth_mm_mean | 14.9966 |
6 | Chinstrap | bill_depth_mm_mean | 18.4206 |
7 | Adelie | flipper_length_mm_mean | 190.103 |
8 | Gentoo | flipper_length_mm_mean | 217.235 |
9 | Chinstrap | flipper_length_mm_mean | 195.824 |
10 | Adelie | body_mass_g_mean | 3706.16 |
11 | Gentoo | body_mass_g_mean | 5092.44 |
12 | Chinstrap | body_mass_g_mean | 3733.09 |
13 | Adelie | year_mean | 2008.05 |
14 | Gentoo | year_mean | 2008.07 |
15 | Chinstrap | year_mean | 2007.97 |
16 | Adelie | bill_stuff_mean | 57.1712 |
17 | Gentoo | bill_stuff_mean | 62.5647 |
18 | Chinstrap | bill_stuff_mean | 67.2544 |
To reshape back to wide, we can use the (appropriately named!) unstack()
function. This function has the following signature:
unstack(df, id_col, variable_names, variable_values)
So in the example below, species is the id variable, metric is the column with variable names, and value is the column with values.
unstack(species_means_long, :species, :metric, :value)
Row | species | bill_length_mm_mean | bill_depth_mm_mean | flipper_length_mm_mean | body_mass_g_mean | year_mean | bill_stuff_mean |
---|---|---|---|---|---|---|---|
String15 | Float64? | Float64? | Float64? | Float64? | Float64? | Float64? | |
1 | Adelie | 38.824 | 18.3473 | 190.103 | 3706.16 | 2008.05 | 57.1712 |
2 | Gentoo | 47.5681 | 14.9966 | 217.235 | 5092.44 | 2008.07 | 62.5647 |
3 | Chinstrap | 48.8338 | 18.4206 | 195.824 | 3733.09 | 2007.97 | 67.2544 |
Chaining Operations
You can chain dataframe operations together in Julia using the Chain.jl
package. This is basically the same as using the pipe in tidyverse R workflows to set up data processing pipelines.
For instance, we can do the following to summarize our data:
@chain penguins begin
dropmissinggroupby(:species)
combine(:body_mass_g => mean)
end
Row | species | body_mass_g_mean |
---|---|---|
String15 | Float64 | |
1 | Adelie | 3706.16 |
2 | Gentoo | 5092.44 |
3 | Chinstrap | 3733.09 |
The “default” specification of a chain (like above) is to assume the output of the previous function call is the first argument to the subsequent function call. So the output of dropmissing
– a dataframe – is the first argument to groupby()
, etc.
Another way to specify a chain is to use an underscore _
as a placeholder for the output of the previous expression. This isn’t super useful in cases where a dataframe will be the first argument, like below:
@chain penguins begin
dropmissing(_)
groupby(_, :species)
combine(_, :body_mass_g => mean)
end
Row | species | body_mass_g_mean |
---|---|---|
String15 | Float64 | |
1 | Adelie | 3706.16 |
2 | Gentoo | 5092.44 |
3 | Chinstrap | 3733.09 |
A not-very-useful but nevertheless illustrative version of using an underscore to pipe a value into the not-first argument of a function is below:
@chain :body_mass_g begin
combine(dropmissing(penguins), _ => mean)
end
Row | body_mass_g_mean |
---|---|
Float64 | |
1 | 4207.06 |