Videos related to this segment (click the title to watch)

02 JULIA2 - 1A: Introduction and data import (18:28)

02 JULIA2 - 1B: Getting insights of the data (25:26)

02 JULIA2 - 1C: Edit data and dataframe structure (23:40)

02 JULIA2 - 1D: Pivot, Split-Apply-Combine and data export (23:39)

0201 - Data Wrangling

Some stuff to set-up the environment..

julia> cd(@__DIR__)
julia> using Pkg
julia> Pkg.activate(".")
       # If using a Julia version different than 1.10 please uncomment and run the following line (reproductibility guarantee will hower be lost)
       #Pkg.resolve()  Activating project at `~/work/SPMLJ/SPMLJ/buildedDoc/02_-_JULIA2_-_Scientific_programming_with_Julia`
julia> Pkg.instantiate()
julia> using Random
julia> Random.seed!(123)Random.TaskLocalRNG()

Introduction

This segment will be mostly based on DataFrames and related packages

!!! info DataFrames vs Matrix DataFrames are popular format for in-memory tabular data. Their main advantages over Arrays are that they can efficiently store different types of data on each column (indeed each column is a wrapper over an Array{T,1} where T is specific to each column) and, thanks also to their named columns, provide convenient API for data operations, like indexing, querying , joining, split-apply-combine, etc.

Info

In most circunstances we can refer to dataframe columns either by using their name as a string, e.g. "Region", or as symbol, e.g. :Region. In the rest of the segment we will use the string approach.

Data import

Our example: Forest volumes and area by country and year Source: Eurostat; units: forarea: Milion hectars, forvol: Milion cubic metres

Built-in solution: CSV –> Matrix

julia> using DelimitedFiles  # in the stdlib
julia> data = convert(Array{Float64,2},readdlm("data.csv",';')[2:end,3:end]) # Skip the first 1 row and the first 2 columns - out is a Matrix8×2 Matrix{Float64}:
 11.354    3381.0
 15.288    2254.28
  8.36925  1058.71
 28.163    3184.67
 11.419    3663.0
 17.253    3055.83
  9.56613  1424.4
 27.98     3653.91

CSV.jl: {CSV file, hardwritten data} –> DataFrame

julia> using CSV, DataFrames
julia> data = CSV.read("data.csv",DataFrame) # source, sink, kword options8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String7  Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> data = CSV.read("data.csv",NamedTuple)(Country = InlineStrings.String7["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"], Year = [2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020], forarea = [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98], forvol = [3381.0, 2254.28, 1058.71, 3184.67, 3663.0, 3055.83, 1424.4, 3653.91])
julia> data = CSV.read(IOBuffer("""
       Country	Year	forarea	forvol
       Germany	2000	11.354	3381
       France	2000	15.288	2254.28
       Italy	2000	8.36925	1058.71
       Sweden	2000	28.163	3184.67
       Germany	2020	11.419	3663
       France	2020	17.253	3055.83
       Italy	2020	9.56613	1424.4
       Sweden	2020	27.98	3653.91
       """), DataFrame, copycols=true)8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String7  Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91

Some common CSV.jl options: delim (use '\t' for tab delimited files), quotechar, openquotechar, closequotechar, escapechar, missingstring, dateformat, append, writeheader, header, newline, quotestrings, decimal, header, normalizenames, datarow, skipto, footerskip, limit, transpose, comment, use_mmap, type, types (e.g. types=Dict("fieldFoo" => Union{Missing,Int64})), typemap, pool, categorical, strict, silencewarnings, ignorerepeated

XLSX.jl: xlsx -> {Matrix, DataFrame}

julia> using XLSX
julia> sheetNames = XLSX.sheetnames(XLSX.readxlsx("data.xlsx"))1-element Vector{String}:
 "Sheet1"
julia> data = XLSX.readxlsx("data.xlsx")["Sheet1"]["A1:D9"]9×4 Matrix{Any}:
 "Country"      "Year"    "forarea"      "forvol"
 "Germany"  2000        11.354       3381
 "France"   2000        15.288       2254.28
 "Italy"    2000         8.36925     1058.71
 "Sweden"   2000        28.163       3184.67
 "Germany"  2020        11.419       3663
 "France"   2020        17.253       3055.83
 "Italy"    2020         9.56613     1424.4
 "Sweden"   2020        27.98        3653.91
julia> data = XLSX.readxlsx("data.xlsx")["Sheet1"][:]9×4 Matrix{Any}:
 "Country"      "Year"    "forarea"      "forvol"
 "Germany"  2000        11.354       3381
 "France"   2000        15.288       2254.28
 "Italy"    2000         8.36925     1058.71
 "Sweden"   2000        28.163       3184.67
 "Germany"  2020        11.419       3663
 "France"   2020        17.253       3055.83
 "Italy"    2020         9.56613     1424.4
 "Sweden"   2020        27.98        3653.91
julia> data = XLSX.readdata("data.xlsx", "Sheet1", "A1:D9")9×4 Matrix{Any}:
 "Country"      "Year"    "forarea"      "forvol"
 "Germany"  2000        11.354       3381
 "France"   2000        15.288       2254.28
 "Italy"    2000         8.36925     1058.71
 "Sweden"   2000        28.163       3184.67
 "Germany"  2020        11.419       3663
 "France"   2020        17.253       3055.83
 "Italy"    2020         9.56613     1424.4
 "Sweden"   2020        27.98        3653.91
julia> XLSX.readtable("data.xlsx", "Sheet1") # a specific XLSX data structure usable as DF constructorXLSX.DataTable(Any[Any["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"], Any[2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020], Any[11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98], Any[3381, 2254.28, 1058.71, 3184.67, 3663, 3055.83, 1424.4, 3653.91]], [:Country, :Year, :forarea, :forvol], Dict(:Country => 1, :forvol => 4, :Year => 2, :forarea => 3))
julia> data = DataFrame(XLSX.readtable("data.xlsx", "Sheet1"))8×4 DataFrame
 Row │ Country  Year  forarea  forvol
     │ Any      Any   Any      Any
─────┼─────────────────────────────────
   1 │ Germany  2000  11.354   3381
   2 │ France   2000  15.288   2254.28
   3 │ Italy    2000  8.36925  1058.71
   4 │ Sweden   2000  28.163   3184.67
   5 │ Germany  2020  11.419   3663
   6 │ France   2020  17.253   3055.83
   7 │ Italy    2020  9.56613  1424.4
   8 │ Sweden   2020  27.98    3653.91

OdsIO.jl: ods -> {Matrix, DataFrame}

julia> using OdsIO
julia> ods_read("data.ods";sheetName="Sheet1",retType="DataFrame")8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> ods_read("data.ods";sheetName="Sheet1",retType="Matrix",range=[(2,3),(9,4)]) # [(tlr,tlc),(brr,brc)]8×2 Matrix{Any}:
 11.354    3381
 15.288    2254.28
  8.36925  1058.71
 28.163    3184.67
 11.419    3663
 17.253    3055.83
  9.56613  1424.4
 27.98     3653.91

HTTP.jl: from internet

julia> import HTTP, Pipe.@pipe
julia> using ZipFile, Tar
julia> urlData  = "https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.csv""https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.csv"
julia> urlDataZ = "https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.zip""https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.zip"
julia> data = @pipe HTTP.get(urlData).body        |>
                    replace!(_, UInt8(';') => UInt8(' ')) |>  # if we need to do modifications to the file before importing
                    CSV.File(_, delim=' ')                |>
                    DataFrame;

ZipFile.jl : from a single zipped csv file..

...on disk...

julia> data = @pipe ZipFile.Reader("data.zip").files[1] |>
                    CSV.File(read(_), delim=';') |>
                    DataFrame8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String7  Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91

...or on internet:

julia> data = @pipe HTTP.get(urlDataZ).body       |>
                    IOBuffer(_)                      |>
                    ZipFile.Reader(_).files[1]    |>
                    CSV.File(read(_), delim=';')  |>
                    DataFrame8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String7  Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91

DataFrame constructor

julia> # named tuple of (colName,colData)..
       data = DataFrame(
           country = ["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"],
           year    = [2000,2000,2000,2000,2020,2020,2020,2020],
           forarea = [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98],
           forvol  = [3381, 2254.28, 1058.71, 3184.67, 3663, 3055.83, 1424.4, 3653.91]
       )8×4 DataFrame
 Row │ country  year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91

Matrix -> DataFrame

Headers and data separated:

julia> M = ["Germany"	2000	11.354	3381
            "France"	2000	15.288	2254.28
            "Italy"	2000	8.36925	1058.71
            "Sweden"	2000	28.163	3184.67
            "Germany"	2020	11.419	3663
            "France"	2020	17.253	3055.83
            "Italy"	2020	9.56613	1424.4
            "Sweden"	2020	27.98	3653.91]8×4 Matrix{Any}:
 "Germany"  2000  11.354    3381
 "France"   2000  15.288    2254.28
 "Italy"    2000   8.36925  1058.71
 "Sweden"   2000  28.163    3184.67
 "Germany"  2020  11.419    3663
 "France"   2020  17.253    3055.83
 "Italy"    2020   9.56613  1424.4
 "Sweden"   2020  27.98     3653.91
julia> headers = ["Country", "Year", "forarea", "forvol"]
       # array of colData arrays, array of headers4-element Vector{String}:
 "Country"
 "Year"
 "forarea"
 "forvol"
julia> data = DataFrame([[M[:,i]...] for i in 1:size(M,2)], Symbol.(headers))8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91

Headers on the first row of the matrix:

julia> M = ["Country"	"Year"	"forarea" "forvol"
            "Germany"	2000	11.354	  3381
            "France"	2000	15.288	  2254.28
            "Italy"	2000	8.36925	  1058.71
            "Sweden"	2000	28.163	  3184.67
            "Germany"	2020	11.419	  3663
            "France"	2020	17.253	  3055.83
            "Italy"	2020	9.56613	  1424.4
            "Sweden"	2020	27.98	  3653.91]9×4 Matrix{Any}:
 "Country"      "Year"    "forarea"      "forvol"
 "Germany"  2000        11.354       3381
 "France"   2000        15.288       2254.28
 "Italy"    2000         8.36925     1058.71
 "Sweden"   2000        28.163       3184.67
 "Germany"  2020        11.419       3663
 "France"   2020        17.253       3055.83
 "Italy"    2020         9.56613     1424.4
 "Sweden"   2020        27.98        3653.91
julia> data = DataFrame([[M[2:end,i]...] for i in 1:size(M,2)], Symbol.(M[1,:])) # note the autorecognision of col types8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91

Getting insights on the data

In VSCode, we can also use the workpanel for a nice sortable tabular view of a df

julia> show(data,allrows=true,allcols=true)8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> first(data,6)6×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
julia> last(data, 6)6×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Italy     2000   8.36925  1058.71
   2 │ Sweden    2000  28.163    3184.67
   3 │ Germany   2020  11.419    3663.0
   4 │ France    2020  17.253    3055.83
   5 │ Italy     2020   9.56613  1424.4
   6 │ Sweden    2020  27.98     3653.91
julia> describe(data)4×7 DataFrame
 Row │ variable  mean     min      median   max     nmissing  eltype
     │ Symbol    Union…   Any      Union…   Any     Int64     DataType
─────┼─────────────────────────────────────────────────────────────────
   1 │ Country            France            Sweden         0  String
   2 │ Year      2010.0   2000     2010.0   2020           0  Int64
   3 │ forarea   16.174   8.36925  13.3535  28.163         0  Float64
   4 │ forvol    2709.47  1058.71  3120.25  3663.0         0  Float64
julia> nR,nC = size(data)(8, 4)
julia> names(data)4-element Vector{String}:
 "Country"
 "Year"
 "forarea"
 "forvol"
julia> for r in eachrow(data)
           println(r) # note is it a "DataFrameRow"
       endDataFrameRow
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┼──────────────────────────────────
   1 │ Germany   2000   11.354   3381.0
DataFrameRow
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┼──────────────────────────────────
   2 │ France    2000   15.288  2254.28
DataFrameRow
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┼──────────────────────────────────
   3 │ Italy     2000  8.36925  1058.71
DataFrameRow
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┼──────────────────────────────────
   4 │ Sweden    2000   28.163  3184.67
DataFrameRow
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┼──────────────────────────────────
   5 │ Germany   2020   11.419   3663.0
DataFrameRow
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┼──────────────────────────────────
   6 │ France    2020   17.253  3055.83
DataFrameRow
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┼──────────────────────────────────
   7 │ Italy     2020  9.56613   1424.4
DataFrameRow
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┼──────────────────────────────────
   8 │ Sweden    2020    27.98  3653.91
julia> for c in eachcol(data)
           println(c) # an array
           println(nonmissingtype(eltype(c)))
       end["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"]
String
[2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020]
Int64
[11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98]
Float64
[3381.0, 2254.28, 1058.71, 3184.67, 3663.0, 3055.83, 1424.4, 3653.91]
Float64

Selection and querying data

Column(s) selection

In general we can select: (a) by name (strings or symbols) or by position; (b) a single or a vector of columns, (c) copying or making a view (a reference without copying) of the underlying data

julia> data[:,["Country","Year"]] # copy, using strings8×2 DataFrame
 Row │ Country  Year
     │ String   Int64
─────┼────────────────
   1 │ Germany   2000
   2 │ France    2000
   3 │ Italy     2000
   4 │ Sweden    2000
   5 │ Germany   2020
   6 │ France    2020
   7 │ Italy     2020
   8 │ Sweden    2020
julia> data[!,[:Country,:Year]]   # view, using symbols8×2 DataFrame
 Row │ Country  Year
     │ String   Int64
─────┼────────────────
   1 │ Germany   2000
   2 │ France    2000
   3 │ Italy     2000
   4 │ Sweden    2000
   5 │ Germany   2020
   6 │ France    2020
   7 │ Italy     2020
   8 │ Sweden    2020
julia> data.Year                  # equiv. to `data[!,Year]`8-element Vector{Int64}:
 2000
 2000
 2000
 2000
 2020
 2020
 2020
 2020
julia> data[:,1]                  # also `data[!,1]`8-element Vector{String}:
 "Germany"
 "France"
 "Italy"
 "Sweden"
 "Germany"
 "France"
 "Italy"
 "Sweden"
julia> data[:,Not(["Year"])]8×3 DataFrame
 Row │ Country  forarea   forvol
     │ String   Float64   Float64
─────┼────────────────────────────
   1 │ Germany  11.354    3381.0
   2 │ France   15.288    2254.28
   3 │ Italy     8.36925  1058.71
   4 │ Sweden   28.163    3184.67
   5 │ Germany  11.419    3663.0
   6 │ France   17.253    3055.83
   7 │ Italy     9.56613  1424.4
   8 │ Sweden   27.98     3653.91

Row(s) selection

julia> data[1,:]    # DataFrameRowDataFrameRow
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┼──────────────────────────────────
   1 │ Germany   2000   11.354   3381.0
julia> data[1:3,:]  # DataFrame3×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71

Note rows have no title names as colums do.

Cell(s) selection

julia> data[2,[2,4]]DataFrameRow
 Row │ Year   forvol
     │ Int64  Float64
─────┼────────────────
   2 │  2000  2254.28
julia> data[2,"forarea"]15.288

Note that the returned selection is:

an Array{T,1} if it is a single column;
a DataFrameRow (similar in behaviour to a DataFrame) if a single row;
T if a single cell;
an other DataFrame otherwise.

Boolean selection

Both rows and column of a DataFrame (but also of an Matrix) can be selected by passing an array of booleans as column or row mask (and, only for Matrices, also a matrix of booleans)

julia> mask = [false, false, false, false, true, true, true, true]8-element Vector{Bool}:
 0
 0
 0
 0
 1
 1
 1
 1
julia> data[mask,:]4×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2020  11.419    3663.0
   2 │ France    2020  17.253    3055.83
   3 │ Italy     2020   9.56613  1424.4
   4 │ Sweden    2020  27.98     3653.91
julia> mask = fill(false,nR,nC)8×4 Matrix{Bool}:
 0  0  0  0
 0  0  0  0
 0  0  0  0
 0  0  0  0
 0  0  0  0
 0  0  0  0
 0  0  0  0
 0  0  0  0
julia> mask[2:end,2:3] .= true7×2 view(::Matrix{Bool}, 2:8, 2:3) with eltype Bool:
 1  1
 1  1
 1  1
 1  1
 1  1
 1  1
 1  1
julia> mask
       # data[mask] # error !8×4 Matrix{Bool}:
 0  0  0  0
 0  1  1  0
 0  1  1  0
 0  1  1  0
 0  1  1  0
 0  1  1  0
 0  1  1  0
 0  1  1  0
julia> Matrix(data)[mask]14-element Vector{Any}:
 2000
 2000
 2000
 2020
 2020
 2020
 2020
   15.288
    8.36925
   28.163
   11.419
   17.253
    9.56613
   27.98

Boolean selection can be used to filter on conditions, e.g.:

julia> data[data.Year .>= 2020,:]4×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2020  11.419    3663.0
   2 │ France    2020  17.253    3055.83
   3 │ Italy     2020   9.56613  1424.4
   4 │ Sweden    2020  27.98     3653.91
julia> data[[i in ["France", "Italy"] for i in data.Country] .&& (data.Year .== 2000),:] # note the parhenthesis2×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ France    2000  15.288    2254.28
   2 │ Italy     2000   8.36925  1058.71

Filtering using the @subset macro from the `DataFramesMacro` package

julia> using DataFramesMeta
julia> @subset(data, :Year .> 2010 )4×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2020  11.419    3663.0
   2 │ France    2020  17.253    3055.83
   3 │ Italy     2020   9.56613  1424.4
   4 │ Sweden    2020  27.98     3653.91
julia> colToFilter = :Country:Country
julia> @subset(data, :Year .> 2010, cols(colToFilter) .== "France" ) # Conditions are "end" by default. If the column name is embedded in a varaible we eed to use `cols(varname)`1×4 DataFrame
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┼──────────────────────────────────
   1 │ France    2020   17.253  3055.83

Filtering using the `Query` package

julia> using Query
julia> dfOut = @from i in data begin # `i` is a single row
           @where i.Country == "France" .&& i.Year >= 2000
           # Select a group of columns, eventually changing their name:
           @select {i.Year, FranceForArea=i.forarea} # or just `i` for the whole row
           @collect DataFrame
       end2×2 DataFrame
 Row │ Year   FranceForArea
     │ Int64  Float64
─────┼──────────────────────
   1 │  2000         15.288
   2 │  2020         17.253

long but flexible

Managing missing values

Tip

See also the section Missingness implementations for a general discussion on missing values. BetaML has now several algorithms for missing imputation.

julia> df = copy(data)
       # df[3,"forarea"]  = missing # Error, type is Flat64, not Union{Float64,Missing}8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> df.forarea = allowmissing(df.forarea) # also disallowmissing8-element Vector{Union{Missing, Float64}}:
 11.354
 15.288
  8.36925
 28.163
 11.419
 17.253
  9.56613
 27.98
julia> allowmissing!(df)8×4 DataFrame
 Row │ Country  Year    forarea   forvol
     │ String?  Int64?  Float64?  Float64?
─────┼─────────────────────────────────────
   1 │ Germany    2000  11.354     3381.0
   2 │ France     2000  15.288     2254.28
   3 │ Italy      2000   8.36925   1058.71
   4 │ Sweden     2000  28.163     3184.67
   5 │ Germany    2020  11.419     3663.0
   6 │ France     2020  17.253     3055.83
   7 │ Italy      2020   9.56613   1424.4
   8 │ Sweden     2020  27.98      3653.91
julia> df[3,"forarea"]  = missingmissing
julia> df[6,"forarea"]  = missingmissing
julia> df[6,"Country"]  = missingmissing
julia> nMissings        = length(findall(x -> ismissing(x), df.forarea)) # Count `missing` values in a column.2
julia> dropmissing(df)6×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Sweden    2000  28.163    3184.67
   4 │ Germany   2020  11.419    3663.0
   5 │ Italy     2020   9.56613  1424.4
   6 │ Sweden    2020  27.98     3653.91
julia> dropmissing(df[:,["forarea","forvol"]])6×2 DataFrame
 Row │ forarea   forvol
     │ Float64   Float64
─────┼───────────────────
   1 │ 11.354    3381.0
   2 │ 15.288    2254.28
   3 │ 28.163    3184.67
   4 │ 11.419    3663.0
   5 │  9.56613  1424.4
   6 │ 27.98     3653.91
julia> collect(skipmissing(df.forarea))6-element Vector{Float64}:
 11.354
 15.288
 28.163
 11.419
  9.56613
 27.98
julia> completecases(df)8-element BitVector:
 1
 1
 0
 1
 1
 0
 1
 1
julia> completecases(df[!,["forarea","forvol"]])8-element BitVector:
 1
 1
 0
 1
 1
 0
 1
 1
julia> [df[ismissing.(df[!,col]), col] .= 0 for col in names(df) if nonmissingtype(eltype(df[!,col])) <: Number] # Replace `missing` with `0` values in all numeric columns, like `Float64` and `Int64`;3-element Vector{SubArray{T, 1, P, Tuple{Vector{Int64}}, false} where {T, P}}:
 0-element view(::Vector{Union{Missing, Int64}}, Int64[]) with eltype Union{Missing, Int64}
 Union{Missing, Float64}[0.0, 0.0]
 0-element view(::Vector{Union{Missing, Float64}}, Int64[]) with eltype Union{Missing, Float64}
julia> [df[ismissing.(df[!,col]), col] .= "" for col in names(df) if nonmissingtype(eltype(df[!,col])) <: AbstractString] # Replace `missing` with `""` values in all string columns;1-element Vector{SubArray{Union{Missing, String}, 1, Vector{Union{Missing, String}}, Tuple{Vector{Int64}}, false}}:
 [""]
julia> df8×4 DataFrame
 Row │ Country  Year    forarea   forvol
     │ String?  Int64?  Float64?  Float64?
─────┼─────────────────────────────────────
   1 │ Germany    2000  11.354     3381.0
   2 │ France     2000  15.288     2254.28
   3 │ Italy      2000   0.0       1058.71
   4 │ Sweden     2000  28.163     3184.67
   5 │ Germany    2020  11.419     3663.0
   6 │            2020   0.0       3055.83
   7 │ Italy      2020   9.56613   1424.4
   8 │ Sweden     2020  27.98      3653.91

Editing data

julia> df = copy(data)8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> df[1,"forarea"] = 11.311.3
julia> df[[2,4],"forarea"] .= 102-element view(::Vector{Float64}, [2, 4]) with eltype Float64:
 10.0
 10.0
julia> df8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.3      3381.0
   2 │ France    2000  10.0      2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  10.0      3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> push!(df,["UK",2020,5.0,800.0]) # add row9×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.3      3381.0
   2 │ France    2000  10.0      2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  10.0      3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
   9 │ UK        2020   5.0       800.0
julia> sort!(df,["Country","Year"], rev=false)9×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ France    2000  10.0      2254.28
   2 │ France    2020  17.253    3055.83
   3 │ Germany   2000  11.3      3381.0
   4 │ Germany   2020  11.419    3663.0
   5 │ Italy     2000   8.36925  1058.71
   6 │ Italy     2020   9.56613  1424.4
   7 │ Sweden    2000  10.0      3184.67
   8 │ Sweden    2020  27.98     3653.91
   9 │ UK        2020   5.0       800.0
julia> df2 = similar(df) # rubish inside9×4 DataFrame
 Row │ Country  Year             forarea       forvol
     │ String   Int64            Float64       Float64
─────┼──────────────────────────────────────────────────
   1 │ #undef   140147749335232  6.92422e-310  5.0e-324
   2 │ #undef   140147749335168  6.92422e-310  5.0e-324
   3 │ #undef   140147749343856  6.92422e-310  5.0e-324
   4 │ #undef   140147749343856  6.92422e-310  0.0
   5 │ #undef   140146619295008  6.92422e-310  0.0
   6 │ #undef   140146619295008  6.92422e-310  5.0e-324
   7 │ #undef   140146619295904  6.92422e-310  5.0e-324
   8 │ #undef   140147749343856  6.92422e-310  5.0e-324
   9 │ #undef   140147676691024  6.92423e-310  0.0
julia> df = similar(df,0) # empty a dataframe. The second parameter is the number of rows desired0×4 DataFrame
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┴──────────────────────────────────

Work on dataframe structure

julia> df       = copy(data)8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> df.foo   = [1,2,3,4,5,6,7,8] # existing or new column8-element Vector{Int64}:
 1
 2
 3
 4
 5
 6
 7
 8
julia> df.volHa = data.forvol ./ data.forarea8-element Vector{Float64}:
 297.78051787916155
 147.45421245421247
 126.49998506437257
 113.07992756453503
 320.78115421665643
 177.1187619544427
 148.9003390085646
 130.59006433166547
julia> df.goo   = Array{Union{Missing,Float64},1}(missing,size(data,1))8-element Vector{Union{Missing, Float64}}:
 missing
 missing
 missing
 missing
 missing
 missing
 missing
 missing
julia> select!(df,Not(["foo","volHa","goo"])) # remove  cols by name8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> rename!(df, ["Country2", "Year2", "forarea2", "forvol2"])8×4 DataFrame
 Row │ Country2  Year2  forarea2  forvol2
     │ String    Int64  Float64   Float64
─────┼────────────────────────────────────
   1 │ Germany    2000  11.354    3381.0
   2 │ France     2000  15.288    2254.28
   3 │ Italy      2000   8.36925  1058.71
   4 │ Sweden     2000  28.163    3184.67
   5 │ Germany    2020  11.419    3663.0
   6 │ France     2020  17.253    3055.83
   7 │ Italy      2020   9.56613  1424.4
   8 │ Sweden     2020  27.98     3653.91
julia> rename!(df, Dict("Country2" => "Country"))8×4 DataFrame
 Row │ Country  Year2  forarea2  forvol2
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> df       = df[:,["Year2", "Country","forarea2","forvol2"] ] #  change column order8×4 DataFrame
 Row │ Year2  Country  forarea2  forvol2
     │ Int64  String   Float64   Float64
─────┼───────────────────────────────────
   1 │  2000  Germany  11.354    3381.0
   2 │  2000  France   15.288    2254.28
   3 │  2000  Italy     8.36925  1058.71
   4 │  2000  Sweden   28.163    3184.67
   5 │  2020  Germany  11.419    3663.0
   6 │  2020  France   17.253    3055.83
   7 │  2020  Italy     9.56613  1424.4
   8 │  2020  Sweden   27.98     3653.91
julia> insertcols!(df, 2, :foo => [1,2,3,4,5,6,7,8] ) # insert a column at position 28×5 DataFrame
 Row │ Year2  foo    Country  forarea2  forvol2
     │ Int64  Int64  String   Float64   Float64
─────┼──────────────────────────────────────────
   1 │  2000      1  Germany  11.354    3381.0
   2 │  2000      2  France   15.288    2254.28
   3 │  2000      3  Italy     8.36925  1058.71
   4 │  2000      4  Sweden   28.163    3184.67
   5 │  2020      5  Germany  11.419    3663.0
   6 │  2020      6  France   17.253    3055.83
   7 │  2020      7  Italy     9.56613  1424.4
   8 │  2020      8  Sweden   27.98     3653.91
julia> df.namedYear = map(string,df.Year2)8-element Vector{String}:
 "2000"
 "2000"
 "2000"
 "2000"
 "2020"
 "2020"
 "2020"
 "2020"
julia> stringToInt(str) = try parse(Int64, str) catch; return(missing) end; df.Year3 = map(stringToInt, df.namedYear)8-element Vector{Int64}:
 2000
 2000
 2000
 2000
 2020
 2020
 2020
 2020
julia> df2 = hcat(df,data,makeunique=true)8×11 DataFrame
 Row │ Year2  foo    Country  forarea2  forvol2  namedYear  Year3  Country_1   ⋯
     │ Int64  Int64  String   Float64   Float64  String     Int64  String      ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │  2000      1  Germany  11.354    3381.0   2000        2000  Germany     ⋯
   2 │  2000      2  France   15.288    2254.28  2000        2000  France
   3 │  2000      3  Italy     8.36925  1058.71  2000        2000  Italy
   4 │  2000      4  Sweden   28.163    3184.67  2000        2000  Sweden
   5 │  2020      5  Germany  11.419    3663.0   2020        2020  Germany     ⋯
   6 │  2020      6  France   17.253    3055.83  2020        2020  France
   7 │  2020      7  Italy     9.56613  1424.4   2020        2020  Italy
   8 │  2020      8  Sweden   27.98     3653.91  2020        2020  Sweden
                                                               3 columns omitted
julia> df3 = copy(data)8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> df4 = vcat(data,df3)16×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
   9 │ Germany   2000  11.354    3381.0
  10 │ France    2000  15.288    2254.28
  11 │ Italy     2000   8.36925  1058.71
  12 │ Sweden    2000  28.163    3184.67
  13 │ Germany   2020  11.419    3663.0
  14 │ France    2020  17.253    3055.83
  15 │ Italy     2020   9.56613  1424.4
  16 │ Sweden    2020  27.98     3653.91

Categorical data

julia> using CategoricalArrays # You may want to consider also PooledArrays
julia> df.Year2 = categorical(df.Year2)8-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 2000
 2000
 2000
 2000
 2020
 2020
 2020
 2020
julia> transform!(df, names(df, AbstractString) .=> categorical, renamecols=false) # transform to categorical all string columns8×7 DataFrame
 Row │ Year2  foo    Country  forarea2  forvol2  namedYear  Year3
     │ Cat…   Int64  Cat…     Float64   Float64  Cat…       Int64
─────┼────────────────────────────────────────────────────────────
   1 │ 2000       1  Germany  11.354    3381.0   2000        2000
   2 │ 2000       2  France   15.288    2254.28  2000        2000
   3 │ 2000       3  Italy     8.36925  1058.71  2000        2000
   4 │ 2000       4  Sweden   28.163    3184.67  2000        2000
   5 │ 2020       5  Germany  11.419    3663.0   2020        2020
   6 │ 2020       6  France   17.253    3055.83  2020        2020
   7 │ 2020       7  Italy     9.56613  1424.4   2020        2020
   8 │ 2020       8  Sweden   27.98     3653.91  2020        2020

Warning

Attention that while the memory to store the data decreases, and grouping is way more efficient, filtering with categorical values is not necessarily quicker (indeed it can be a bit slower)

julia> levels(df.Year2)2-element Vector{Int64}:
 2000
 2020
julia> levels!(df.Country,["Sweden","Germany","France","Italy"]) # Let you define a personalised order, useful for ordered data8-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "Germany"
 "France"
 "Italy"
 "Sweden"
 "Germany"
 "France"
 "Italy"
 "Sweden"
julia> sort(df.Country)8-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "Sweden"
 "Sweden"
 "Germany"
 "Germany"
 "France"
 "France"
 "Italy"
 "Italy"
julia> sort!(df,"Country")8×7 DataFrame
 Row │ Year2  foo    Country  forarea2  forvol2  namedYear  Year3
     │ Cat…   Int64  Cat…     Float64   Float64  Cat…       Int64
─────┼────────────────────────────────────────────────────────────
   1 │ 2000       4  Sweden   28.163    3184.67  2000        2000
   2 │ 2020       8  Sweden   27.98     3653.91  2020        2020
   3 │ 2000       1  Germany  11.354    3381.0   2000        2000
   4 │ 2020       5  Germany  11.419    3663.0   2020        2020
   5 │ 2000       2  France   15.288    2254.28  2000        2000
   6 │ 2020       6  France   17.253    3055.83  2020        2020
   7 │ 2000       3  Italy     8.36925  1058.71  2000        2000
   8 │ 2020       7  Italy     9.56613  1424.4   2020        2020
julia> df.Years2 = unwrap.(df.Year2) # convert a categorical array into a normal one.8-element Vector{Int64}:
 2000
 2020
 2000
 2020
 2000
 2020
 2000
 2020

Joining dataframes

julia> df1,df2 = copy(data),copy(data)(8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91, 8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91)
julia> push!(df1,["US",2020,5.0,1000.0])9×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
   9 │ US        2020   5.0      1000.0
julia> push!(df2,["China",2020,50.0,1000.0])9×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
   9 │ China     2020  50.0      1000.0
julia> rename!(df2,"Year"=>"year")9×4 DataFrame
 Row │ Country  year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
   9 │ China     2020  50.0      1000.0
julia> innerjoin(df1,df2,on=["Country","Year"=>"year"],makeunique=true) # common records only8×6 DataFrame
 Row │ Country  Year   forarea   forvol   forarea_1  forvol_1
     │ String   Int64  Float64   Float64  Float64    Float64
─────┼────────────────────────────────────────────────────────
   1 │ Germany   2000  11.354    3381.0    11.354     3381.0
   2 │ France    2000  15.288    2254.28   15.288     2254.28
   3 │ Italy     2000   8.36925  1058.71    8.36925   1058.71
   4 │ Sweden    2000  28.163    3184.67   28.163     3184.67
   5 │ Germany   2020  11.419    3663.0    11.419     3663.0
   6 │ France    2020  17.253    3055.83   17.253     3055.83
   7 │ Italy     2020   9.56613  1424.4     9.56613   1424.4
   8 │ Sweden    2020  27.98     3653.91   27.98      3653.91

Also available: leftjoin (all records on left df), rightjoin (all on right df), outerjoin(all records returned), semijoin (like inner join by only with columns from the right df), antijoin (left not on right df) and crossjoin (like the cartesian product, each on the right by each on the left)

Pivoting data

long and wide are two kind of layout for equivalent representation of tabled data. In a long layout we have each row being an observation of a single variable, and each record is represented as dim1, dim2, ..., value. As the name implies, "long" layouts tend to be relativelly long and hard to analyse by an human, but are very easity to handle. At the opposite, A wide layout represents multiple observations on the same row, eventually using multiple horizontal axis as in the next figure (but Julia dataframes handle only a single horizzontal axis):


	Forest Area		Forest Volumes
	2000	2020	2000	2020
Germany	11.35	11.42	3381	3663
France	15.29	17.25	2254	3056

_wide layout is easier to visually analise for a human mind but much more hard to analyse in a sistermatic way. We will learn now how to move from one type of layout to the other.

Stacking columns: from wide to long

julia> longDf  = stack(data,["forarea","forvol"])    # we specify the variable to stack (the "measured" variables)16×4 DataFrame
 Row │ Country  Year   variable  value
     │ String   Int64  String    Float64
─────┼──────────────────────────────────────
   1 │ Germany   2000  forarea     11.354
   2 │ France    2000  forarea     15.288
   3 │ Italy     2000  forarea      8.36925
   4 │ Sweden    2000  forarea     28.163
   5 │ Germany   2020  forarea     11.419
   6 │ France    2020  forarea     17.253
   7 │ Italy     2020  forarea      9.56613
   8 │ Sweden    2020  forarea     27.98
   9 │ Germany   2000  forvol    3381.0
  10 │ France    2000  forvol    2254.28
  11 │ Italy     2000  forvol    1058.71
  12 │ Sweden    2000  forvol    3184.67
  13 │ Germany   2020  forvol    3663.0
  14 │ France    2020  forvol    3055.83
  15 │ Italy     2020  forvol    1424.4
  16 │ Sweden    2020  forvol    3653.91
julia> longDf2 = stack(data,Not(["Country","Year"])) # we specify the variables _not_ to stack (the id variables)16×4 DataFrame
 Row │ Country  Year   variable  value
     │ String   Int64  String    Float64
─────┼──────────────────────────────────────
   1 │ Germany   2000  forarea     11.354
   2 │ France    2000  forarea     15.288
   3 │ Italy     2000  forarea      8.36925
   4 │ Sweden    2000  forarea     28.163
   5 │ Germany   2020  forarea     11.419
   6 │ France    2020  forarea     17.253
   7 │ Italy     2020  forarea      9.56613
   8 │ Sweden    2020  forarea     27.98
   9 │ Germany   2000  forvol    3381.0
  10 │ France    2000  forvol    2254.28
  11 │ Italy     2000  forvol    1058.71
  12 │ Sweden    2000  forvol    3184.67
  13 │ Germany   2020  forvol    3663.0
  14 │ France    2020  forvol    3055.83
  15 │ Italy     2020  forvol    1424.4
  16 │ Sweden    2020  forvol    3653.91
julia> longDf3 = stack(data)                         # automatically stack all numerical variables16×4 DataFrame
 Row │ Country  Year   variable  value
     │ String   Int64  String    Float64
─────┼──────────────────────────────────────
   1 │ Germany   2000  forarea     11.354
   2 │ France    2000  forarea     15.288
   3 │ Italy     2000  forarea      8.36925
   4 │ Sweden    2000  forarea     28.163
   5 │ Germany   2020  forarea     11.419
   6 │ France    2020  forarea     17.253
   7 │ Italy     2020  forarea      9.56613
   8 │ Sweden    2020  forarea     27.98
   9 │ Germany   2000  forvol    3381.0
  10 │ France    2000  forvol    2254.28
  11 │ Italy     2000  forvol    1058.71
  12 │ Sweden    2000  forvol    3184.67
  13 │ Germany   2020  forvol    3663.0
  14 │ France    2020  forvol    3055.83
  15 │ Italy     2020  forvol    1424.4
  16 │ Sweden    2020  forvol    3653.91
julia> longDf == longDf2 == longDf3true

Note how the columns variable and value have been added automatically to host the stachked data

Unstacking columns: from wide to long

julia> wideDf = unstack(longDf,["Country","Year"],"variable","value") # args: df, [cols to remains cols also in the wide layout], column with the ids to expand horizontally and column with the relative values8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64?  Float64?
─────┼────────────────────────────────────
   1 │ Germany   2000  11.354     3381.0
   2 │ France    2000  15.288     2254.28
   3 │ Italy     2000   8.36925   1058.71
   4 │ Sweden    2000  28.163     3184.67
   5 │ Germany   2020  11.419     3663.0
   6 │ France    2020  17.253     3055.83
   7 │ Italy     2020   9.56613   1424.4
   8 │ Sweden    2020  27.98      3653.91
julia> wideDf2 = unstack(longDf,"variable","value") # cols to remains cols also in the wide layout omitted: all cols not to expand and relative value col remains as col8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64?  Float64?
─────┼────────────────────────────────────
   1 │ Germany   2000  11.354     3381.0
   2 │ France    2000  15.288     2254.28
   3 │ Italy     2000   8.36925   1058.71
   4 │ Sweden    2000  28.163     3184.67
   5 │ Germany   2020  11.419     3663.0
   6 │ France    2020  17.253     3055.83
   7 │ Italy     2020   9.56613   1424.4
   8 │ Sweden    2020  27.98      3653.91
julia> wideDf == wideDf2 == datatrue

While the DataFrames package doesn't support multiple axis we can still arrive to the table below with a little bit of work by unstacking different columns in separate wide dataframes and then joining or horizontally concatenating them:

julia> wideArea = unstack(data,"Country","Year","forarea")4×3 DataFrame
 Row │ Country  2000      2020
     │ String   Float64?  Float64?
─────┼─────────────────────────────
   1 │ Germany  11.354    11.419
   2 │ France   15.288    17.253
   3 │ Italy     8.36925   9.56613
   4 │ Sweden   28.163    27.98
julia> wideVols = unstack(data,"Country","Year","forvol")4×3 DataFrame
 Row │ Country  2000      2020
     │ String   Float64?  Float64?
─────┼─────────────────────────────
   1 │ Germany   3381.0    3663.0
   2 │ France    2254.28   3055.83
   3 │ Italy     1058.71   1424.4
   4 │ Sweden    3184.67   3653.91
julia> rename!(wideArea,["Country","area_2000","area_2020"])4×3 DataFrame
 Row │ Country  area_2000  area_2020
     │ String   Float64?   Float64?
─────┼───────────────────────────────
   1 │ Germany   11.354     11.419
   2 │ France    15.288     17.253
   3 │ Italy      8.36925    9.56613
   4 │ Sweden    28.163     27.98
julia> rename!(wideVols,["Country","vol_2000","vol_2020"])4×3 DataFrame
 Row │ Country  vol_2000  vol_2020
     │ String   Float64?  Float64?
─────┼─────────────────────────────
   1 │ Germany   3381.0    3663.0
   2 │ France    2254.28   3055.83
   3 │ Italy     1058.71   1424.4
   4 │ Sweden    3184.67   3653.91
julia> wideWideDf = outerjoin(wideArea,wideVols,on="Country")4×5 DataFrame
 Row │ Country  area_2000  area_2020  vol_2000  vol_2020
     │ String   Float64?   Float64?   Float64?  Float64?
─────┼───────────────────────────────────────────────────
   1 │ Germany   11.354     11.419     3381.0    3663.0
   2 │ France    15.288     17.253     2254.28   3055.83
   3 │ Italy      8.36925    9.56613   1058.71   1424.4
   4 │ Sweden    28.163     27.98      3184.67   3653.91

The Split-Apply-Combine strategy

Aka "divide and conquer". Rather than try to modify the dataset direclty, we first split it in subparts, we work on each subpart and then we recombine them in a target dataset

julia> using Statistics # for `mean`
julia> groupby(data,["Country","Year"]) # The "split" partGroupedDataFrame with 8 groups based on keys: Country, Year
First Group (1 row): Country = "Germany", Year = 2000
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┼──────────────────────────────────
   1 │ Germany   2000   11.354   3381.0
⋮
Last Group (1 row): Country = "Sweden", Year = 2020
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┼──────────────────────────────────
   1 │ Sweden    2020    27.98  3653.91

Aggregation:

julia> combine(groupby(data,["Year"]) ,  "forarea" => sum => "sum_area", "forvol" => sum => "sum_vol", nrow)2×4 DataFrame
 Row │ Year   sum_area  sum_vol   nrow
     │ Int64  Float64   Float64   Int64
─────┼──────────────────────────────────
   1 │  2000   63.1743   9878.66      4
   2 │  2020   66.2181  11797.1       4

...or...

julia> combine(groupby(data,["Year"])) do subdf # slower
           (sumarea = sum(subdf.forarea), sumvol = sum(subdf.forvol), nCountries = size(subdf,1))
       end2×4 DataFrame
 Row │ Year   sumarea  sumvol    nCountries
     │ Int64  Float64  Float64   Int64
─────┼──────────────────────────────────────
   1 │  2000  63.1743   9878.66           4
   2 │  2020  66.2181  11797.1            4

Cumulative computation:

julia> a = combine(groupby(data,["Year"])) do subdf # slower
           (country = subdf.Country, area = subdf.forarea, cumArea = cumsum(subdf.forarea))
       end8×4 DataFrame
 Row │ Year   country  area      cumArea
     │ Int64  String   Float64   Float64
─────┼───────────────────────────────────
   1 │  2000  Germany  11.354    11.354
   2 │  2000  France   15.288    26.642
   3 │  2000  Italy     8.36925  35.0112
   4 │  2000  Sweden   28.163    63.1743
   5 │  2020  Germany  11.419    11.419
   6 │  2020  France   17.253    28.672
   7 │  2020  Italy     9.56613  38.2381
   8 │  2020  Sweden   27.98     66.2181

Note in these examples that while in the aggregation we was returning a single record for each subgroup (hence we did some dimensionality reduction) in the cumulative compuation we still output the whole subgroup, so the combined dataframe in output has the same number of rows as the original dataframe.

An alternative approach is to use the @linq macro from the DatAFrameMEta package that provide a R's dplyr-like query language using piped data:

julia> using DataFramesMeta
julia> dfCum = @linq data |>
                   groupby([:Year]) |>
                   transform(:cumArea = cumsum(:forarea))8×5 DataFrame
 Row │ Country  Year   forarea   forvol   cumArea
     │ String   Int64  Float64   Float64  Float64
─────┼────────────────────────────────────────────
   1 │ Germany   2000  11.354    3381.0   11.354
   2 │ France    2000  15.288    2254.28  26.642
   3 │ Italy     2000   8.36925  1058.71  35.0112
   4 │ Sweden    2000  28.163    3184.67  63.1743
   5 │ Germany   2020  11.419    3663.0   11.419
   6 │ France    2020  17.253    3055.83  28.672
   7 │ Italy     2020   9.56613  1424.4   38.2381
   8 │ Sweden    2020  27.98     3653.91  66.2181

Export and saving

DataFrame to Matrix

julia> M = Matrix(data)8×4 Matrix{Any}:
 "Germany"  2000  11.354    3381.0
 "France"   2000  15.288    2254.28
 "Italy"    2000   8.36925  1058.71
 "Sweden"   2000  28.163    3184.67
 "Germany"  2020  11.419    3663.0
 "France"   2020  17.253    3055.83
 "Italy"    2020   9.56613  1424.4
 "Sweden"   2020  27.98     3653.91

Warning

Attention that if the dataframe contains different types across columns, the inner type of the matrix will be Any

julia> M = Matrix{Union{Float64,Int64,String}}(data)8×4 Matrix{Union{Float64, Int64, String}}:
 "Germany"  2000  11.354    3381.0
 "France"   2000  15.288    2254.28
 "Italy"    2000   8.36925  1058.71
 "Sweden"   2000  28.163    3184.67
 "Germany"  2020  11.419    3663.0
 "France"   2020  17.253    3055.83
 "Italy"    2020   9.56613  1424.4
 "Sweden"   2020  27.98     3653.91

DataFrame to Dictionary

julia> function toDict(df, dimCols, valueCol)
           toReturn = Dict()
           for r in eachrow(df)
               keyValues = []
               [push!(keyValues,r[d]) for d in dimCols]
               toReturn[(keyValues...,)] = r[valueCol]
           end
           return toReturn
       endtoDict (generic function with 1 method)
julia> dict = toDict(data,["Country","Year"],["forarea","forvol"])Dict{Any, Any} with 8 entries:
  ("Germany", 2020) => DataFrameRow…
  ("Italy", 2020)   => DataFrameRow…
  ("Italy", 2000)   => DataFrameRow…
  ("Germany", 2000) => DataFrameRow…
  ("Sweden", 2000)  => DataFrameRow…
  ("France", 2020)  => DataFrameRow…
  ("France", 2000)  => DataFrameRow…
  ("Sweden", 2020)  => DataFrameRow…
julia> dict["Germany",2000][1]11.354
julia> dict["Germany",2000]["forvol"]3381.0
julia> toDict(data,["Country","Year"],"forarea")Dict{Any, Any} with 8 entries:
  ("Germany", 2020) => 11.419
  ("Italy", 2020)   => 9.56613
  ("Italy", 2000)   => 8.36925
  ("Germany", 2000) => 11.354
  ("Sweden", 2000)  => 28.163
  ("France", 2020)  => 17.253
  ("France", 2000)  => 15.288
  ("Sweden", 2020)  => 27.98

DataFrame to NamedTuple

julia> nT = NamedTuple(Dict([Symbol(c) => data[:,c]  for c in names(data)]))        # order not necessarily preserved(Country = ["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"], forvol = [3381.0, 2254.28, 1058.71, 3184.67, 3663.0, 3055.83, 1424.4, 3653.91], Year = [2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020], forarea = [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98])
julia> using DataStructures
julia> nT = NamedTuple(OrderedDict([Symbol(c) => data[:,c]  for c in names(data)])) # order preserved(Country = ["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"], Year = [2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020], forarea = [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98], forvol = [3381.0, 2254.28, 1058.71, 3184.67, 3663.0, 3055.83, 1424.4, 3653.91])

Saving as CSV file

julia> rm("outdata.csv", force=true)
julia> CSV.write("outdata.csv",data) # see options at the beginning of segment in the import section and `? CSV.write` for specific export options"outdata.csv"

Saving as OpenDocument spreadsheet

julia> rm("outdata.ods", force=true)
julia> ods_write("outdata.ods",Dict(("myData",3,2) => data)) # exported starting on cell B3 of sheet "myData"

Saving as Excel spreadsheet

julia> rm("outdata.xlsx", force=true)
julia> XLSX.writetable("outdata.xlsx",myData = (collect(eachcol(data)),names(data)))

View this file on Github.

This page was generated using Literate.jl.