Videos related to this segment (click the title to watch)
02 JULIA2 - 1A: Introduction and data import (18:28)
02 JULIA2 - 1B: Getting insights of the data (25:26)
02 JULIA2 - 1C: Edit data and dataframe structure (23:40)
02 JULIA2 - 1D: Pivot, Split-Apply-Combine and data export (23:39)

0201 - Data Wrangling

Some stuff to set-up the environment..

julia> cd(@__DIR__)
julia> using Pkg
julia> Pkg.activate(".") # If using a Julia version different than 1.10 please uncomment and run the following line (reproductibility guarantee will hower be lost) #Pkg.resolve() Activating project at `~/work/SPMLJ/SPMLJ/buildedDoc/02_-_JULIA2_-_Scientific_programming_with_Julia`
julia> Pkg.instantiate()
julia> using Random
julia> Random.seed!(123)Random.TaskLocalRNG()

Introduction

This segment will be mostly based on DataFrames and related packages

!!! info DataFrames vs Matrix DataFrames are popular format for in-memory tabular data. Their main advantages over Arrays are that they can efficiently store different types of data on each column (indeed each column is a wrapper over an Array{T,1} where T is specific to each column) and, thanks also to their named columns, provide convenient API for data operations, like indexing, querying , joining, split-apply-combine, etc.

Info

In most circunstances we can refer to dataframe columns either by using their name as a string, e.g. "Region", or as symbol, e.g. :Region. In the rest of the segment we will use the string approach.

Data import

Our example: Forest volumes and area by country and year Source: Eurostat; units: forarea: Milion hectars, forvol: Milion cubic metres

Built-in solution: CSV –> Matrix

julia> using DelimitedFiles  # in the stdlib
julia> data = convert(Array{Float64,2},readdlm("data.csv",';')[2:end,3:end]) # Skip the first 1 row and the first 2 columns - out is a Matrix8×2 Matrix{Float64}: 11.354 3381.0 15.288 2254.28 8.36925 1058.71 28.163 3184.67 11.419 3663.0 17.253 3055.83 9.56613 1424.4 27.98 3653.91

CSV.jl: {CSV file, hardwritten data} –> DataFrame

julia> using CSV, DataFrames
julia> data = CSV.read("data.csv",DataFrame) # source, sink, kword options8×4 DataFrame Row │ Country Year forarea forvol │ String7 Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> data = CSV.read("data.csv",NamedTuple)(Country = InlineStrings.String7["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"], Year = [2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020], forarea = [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98], forvol = [3381.0, 2254.28, 1058.71, 3184.67, 3663.0, 3055.83, 1424.4, 3653.91])
julia> data = CSV.read(IOBuffer(""" Country Year forarea forvol Germany 2000 11.354 3381 France 2000 15.288 2254.28 Italy 2000 8.36925 1058.71 Sweden 2000 28.163 3184.67 Germany 2020 11.419 3663 France 2020 17.253 3055.83 Italy 2020 9.56613 1424.4 Sweden 2020 27.98 3653.91 """), DataFrame, copycols=true)8×4 DataFrame Row │ Country Year forarea forvol │ String7 Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91

Some common CSV.jl options: delim (use '\t' for tab delimited files), quotechar, openquotechar, closequotechar, escapechar, missingstring, dateformat, append, writeheader, header, newline, quotestrings, decimal, header, normalizenames, datarow, skipto, footerskip, limit, transpose, comment, use_mmap, type, types (e.g. types=Dict("fieldFoo" => Union{Missing,Int64})), typemap, pool, categorical, strict, silencewarnings, ignorerepeated

XLSX.jl: xlsx -> {Matrix, DataFrame}

julia> using XLSX
julia> sheetNames = XLSX.sheetnames(XLSX.readxlsx("data.xlsx"))1-element Vector{String}: "Sheet1"
julia> data = XLSX.readxlsx("data.xlsx")["Sheet1"]["A1:D9"]9×4 Matrix{Any}: "Country" "Year" "forarea" "forvol" "Germany" 2000 11.354 3381 "France" 2000 15.288 2254.28 "Italy" 2000 8.36925 1058.71 "Sweden" 2000 28.163 3184.67 "Germany" 2020 11.419 3663 "France" 2020 17.253 3055.83 "Italy" 2020 9.56613 1424.4 "Sweden" 2020 27.98 3653.91
julia> data = XLSX.readxlsx("data.xlsx")["Sheet1"][:]9×4 Matrix{Any}: "Country" "Year" "forarea" "forvol" "Germany" 2000 11.354 3381 "France" 2000 15.288 2254.28 "Italy" 2000 8.36925 1058.71 "Sweden" 2000 28.163 3184.67 "Germany" 2020 11.419 3663 "France" 2020 17.253 3055.83 "Italy" 2020 9.56613 1424.4 "Sweden" 2020 27.98 3653.91
julia> data = XLSX.readdata("data.xlsx", "Sheet1", "A1:D9")9×4 Matrix{Any}: "Country" "Year" "forarea" "forvol" "Germany" 2000 11.354 3381 "France" 2000 15.288 2254.28 "Italy" 2000 8.36925 1058.71 "Sweden" 2000 28.163 3184.67 "Germany" 2020 11.419 3663 "France" 2020 17.253 3055.83 "Italy" 2020 9.56613 1424.4 "Sweden" 2020 27.98 3653.91
julia> XLSX.readtable("data.xlsx", "Sheet1") # a specific XLSX data structure usable as DF constructorXLSX.DataTable(Any[Any["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"], Any[2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020], Any[11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98], Any[3381, 2254.28, 1058.71, 3184.67, 3663, 3055.83, 1424.4, 3653.91]], [:Country, :Year, :forarea, :forvol], Dict(:Country => 1, :forvol => 4, :Year => 2, :forarea => 3))
julia> data = DataFrame(XLSX.readtable("data.xlsx", "Sheet1"))8×4 DataFrame Row │ Country Year forarea forvol │ Any Any Any Any ─────┼───────────────────────────────── 1 │ Germany 2000 11.354 3381 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91

OdsIO.jl: ods -> {Matrix, DataFrame}

julia> using OdsIO
julia> ods_read("data.ods";sheetName="Sheet1",retType="DataFrame")8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> ods_read("data.ods";sheetName="Sheet1",retType="Matrix",range=[(2,3),(9,4)]) # [(tlr,tlc),(brr,brc)]8×2 Matrix{Any}: 11.354 3381 15.288 2254.28 8.36925 1058.71 28.163 3184.67 11.419 3663 17.253 3055.83 9.56613 1424.4 27.98 3653.91

HTTP.jl: from internet

julia> import HTTP, Pipe.@pipe
julia> using ZipFile, Tar
julia> urlData = "https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.csv""https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.csv"
julia> urlDataZ = "https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.zip""https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.zip"
julia> data = @pipe HTTP.get(urlData).body |> replace!(_, UInt8(';') => UInt8(' ')) |> # if we need to do modifications to the file before importing CSV.File(_, delim=' ') |> DataFrame;

ZipFile.jl : from a single zipped csv file..

...on disk...

julia> data = @pipe ZipFile.Reader("data.zip").files[1] |>
                    CSV.File(read(_), delim=';') |>
                    DataFrame8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String7  Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91

...or on internet:

julia> data = @pipe HTTP.get(urlDataZ).body       |>
                    IOBuffer(_)                      |>
                    ZipFile.Reader(_).files[1]    |>
                    CSV.File(read(_), delim=';')  |>
                    DataFrame8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String7  Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91

DataFrame constructor

julia> # named tuple of (colName,colData)..
       data = DataFrame(
           country = ["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"],
           year    = [2000,2000,2000,2000,2020,2020,2020,2020],
           forarea = [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98],
           forvol  = [3381, 2254.28, 1058.71, 3184.67, 3663, 3055.83, 1424.4, 3653.91]
       )8×4 DataFrame
 Row │ country  year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91

Matrix -> DataFrame

Headers and data separated:

julia> M = ["Germany"	2000	11.354	3381
            "France"	2000	15.288	2254.28
            "Italy"	2000	8.36925	1058.71
            "Sweden"	2000	28.163	3184.67
            "Germany"	2020	11.419	3663
            "France"	2020	17.253	3055.83
            "Italy"	2020	9.56613	1424.4
            "Sweden"	2020	27.98	3653.91]8×4 Matrix{Any}:
 "Germany"  2000  11.354    3381
 "France"   2000  15.288    2254.28
 "Italy"    2000   8.36925  1058.71
 "Sweden"   2000  28.163    3184.67
 "Germany"  2020  11.419    3663
 "France"   2020  17.253    3055.83
 "Italy"    2020   9.56613  1424.4
 "Sweden"   2020  27.98     3653.91
julia> headers = ["Country", "Year", "forarea", "forvol"] # array of colData arrays, array of headers4-element Vector{String}: "Country" "Year" "forarea" "forvol"
julia> data = DataFrame([[M[:,i]...] for i in 1:size(M,2)], Symbol.(headers))8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91

Headers on the first row of the matrix:

julia> M = ["Country"	"Year"	"forarea" "forvol"
            "Germany"	2000	11.354	  3381
            "France"	2000	15.288	  2254.28
            "Italy"	2000	8.36925	  1058.71
            "Sweden"	2000	28.163	  3184.67
            "Germany"	2020	11.419	  3663
            "France"	2020	17.253	  3055.83
            "Italy"	2020	9.56613	  1424.4
            "Sweden"	2020	27.98	  3653.91]9×4 Matrix{Any}:
 "Country"      "Year"    "forarea"      "forvol"
 "Germany"  2000        11.354       3381
 "France"   2000        15.288       2254.28
 "Italy"    2000         8.36925     1058.71
 "Sweden"   2000        28.163       3184.67
 "Germany"  2020        11.419       3663
 "France"   2020        17.253       3055.83
 "Italy"    2020         9.56613     1424.4
 "Sweden"   2020        27.98        3653.91
julia> data = DataFrame([[M[2:end,i]...] for i in 1:size(M,2)], Symbol.(M[1,:])) # note the autorecognision of col types8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91

Getting insights on the data

In VSCode, we can also use the workpanel for a nice sortable tabular view of a df

julia> show(data,allrows=true,allcols=true)8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> first(data,6)6×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83
julia> last(data, 6)6×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Italy 2000 8.36925 1058.71 2 │ Sweden 2000 28.163 3184.67 3 │ Germany 2020 11.419 3663.0 4 │ France 2020 17.253 3055.83 5 │ Italy 2020 9.56613 1424.4 6 │ Sweden 2020 27.98 3653.91
julia> describe(data)4×7 DataFrame Row │ variable mean min median max nmissing eltype │ Symbol Union… Any Union… Any Int64 DataType ─────┼───────────────────────────────────────────────────────────────── 1 │ Country France Sweden 0 String 2 │ Year 2010.0 2000 2010.0 2020 0 Int64 3 │ forarea 16.174 8.36925 13.3535 28.163 0 Float64 4 │ forvol 2709.47 1058.71 3120.25 3663.0 0 Float64
julia> nR,nC = size(data)(8, 4)
julia> names(data)4-element Vector{String}: "Country" "Year" "forarea" "forvol"
julia> for r in eachrow(data) println(r) # note is it a "DataFrameRow" endDataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 2 │ France 2000 15.288 2254.28 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 3 │ Italy 2000 8.36925 1058.71 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 4 │ Sweden 2000 28.163 3184.67 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 5 │ Germany 2020 11.419 3663.0 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 6 │ France 2020 17.253 3055.83 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 7 │ Italy 2020 9.56613 1424.4 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 8 │ Sweden 2020 27.98 3653.91
julia> for c in eachcol(data) println(c) # an array println(nonmissingtype(eltype(c))) end["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"] String [2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020] Int64 [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98] Float64 [3381.0, 2254.28, 1058.71, 3184.67, 3663.0, 3055.83, 1424.4, 3653.91] Float64

Selection and querying data

Column(s) selection

In general we can select: (a) by name (strings or symbols) or by position; (b) a single or a vector of columns, (c) copying or making a view (a reference without copying) of the underlying data

julia> data[:,["Country","Year"]] # copy, using strings8×2 DataFrame
 Row │ Country  Year
     │ String   Int64
─────┼────────────────
   1 │ Germany   2000
   2 │ France    2000
   3 │ Italy     2000
   4 │ Sweden    2000
   5 │ Germany   2020
   6 │ France    2020
   7 │ Italy     2020
   8 │ Sweden    2020
julia> data[!,[:Country,:Year]] # view, using symbols8×2 DataFrame Row │ Country Year │ String Int64 ─────┼──────────────── 1 │ Germany 2000 2 │ France 2000 3 │ Italy 2000 4 │ Sweden 2000 5 │ Germany 2020 6 │ France 2020 7 │ Italy 2020 8 │ Sweden 2020
julia> data.Year # equiv. to `data[!,Year]`8-element Vector{Int64}: 2000 2000 2000 2000 2020 2020 2020 2020
julia> data[:,1] # also `data[!,1]`8-element Vector{String}: "Germany" "France" "Italy" "Sweden" "Germany" "France" "Italy" "Sweden"
julia> data[:,Not(["Year"])]8×3 DataFrame Row │ Country forarea forvol │ String Float64 Float64 ─────┼──────────────────────────── 1 │ Germany 11.354 3381.0 2 │ France 15.288 2254.28 3 │ Italy 8.36925 1058.71 4 │ Sweden 28.163 3184.67 5 │ Germany 11.419 3663.0 6 │ France 17.253 3055.83 7 │ Italy 9.56613 1424.4 8 │ Sweden 27.98 3653.91

Row(s) selection

julia> data[1,:]    # DataFrameRowDataFrameRow
 Row │ Country  Year   forarea  forvol
     │ String   Int64  Float64  Float64
─────┼──────────────────────────────────
   1 │ Germany   2000   11.354   3381.0
julia> data[1:3,:] # DataFrame3×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71

Note rows have no title names as colums do.

Cell(s) selection

julia> data[2,[2,4]]DataFrameRow
 Row │ Year   forvol
     │ Int64  Float64
─────┼────────────────
   2 │  2000  2254.28
julia> data[2,"forarea"]15.288

Note that the returned selection is:

  • an Array{T,1} if it is a single column;
  • a DataFrameRow (similar in behaviour to a DataFrame) if a single row;
  • T if a single cell;
  • an other DataFrame otherwise.

Boolean selection

Both rows and column of a DataFrame (but also of an Matrix) can be selected by passing an array of booleans as column or row mask (and, only for Matrices, also a matrix of booleans)

julia> mask = [false, false, false, false, true, true, true, true]8-element Vector{Bool}:
 0
 0
 0
 0
 1
 1
 1
 1
julia> data[mask,:]4×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2020 11.419 3663.0 2 │ France 2020 17.253 3055.83 3 │ Italy 2020 9.56613 1424.4 4 │ Sweden 2020 27.98 3653.91
julia> mask = fill(false,nR,nC)8×4 Matrix{Bool}: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
julia> mask[2:end,2:3] .= true7×2 view(::Matrix{Bool}, 2:8, 2:3) with eltype Bool: 1 1 1 1 1 1 1 1 1 1 1 1 1 1
julia> mask # data[mask] # error !8×4 Matrix{Bool}: 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
julia> Matrix(data)[mask]14-element Vector{Any}: 2000 2000 2000 2020 2020 2020 2020 15.288 8.36925 28.163 11.419 17.253 9.56613 27.98

Boolean selection can be used to filter on conditions, e.g.:

julia> data[data.Year .>= 2020,:]4×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2020  11.419    3663.0
   2 │ France    2020  17.253    3055.83
   3 │ Italy     2020   9.56613  1424.4
   4 │ Sweden    2020  27.98     3653.91
julia> data[[i in ["France", "Italy"] for i in data.Country] .&& (data.Year .== 2000),:] # note the parhenthesis2×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ France 2000 15.288 2254.28 2 │ Italy 2000 8.36925 1058.71

Filtering using the @subset macro from the DataFramesMacro package

julia> using DataFramesMeta
julia> @subset(data, :Year .> 2010 )4×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2020 11.419 3663.0 2 │ France 2020 17.253 3055.83 3 │ Italy 2020 9.56613 1424.4 4 │ Sweden 2020 27.98 3653.91
julia> colToFilter = :Country:Country
julia> @subset(data, :Year .> 2010, cols(colToFilter) .== "France" ) # Conditions are "end" by default. If the column name is embedded in a varaible we eed to use `cols(varname)`1×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 1 │ France 2020 17.253 3055.83

Filtering using the Query package

julia> using Query
julia> dfOut = @from i in data begin # `i` is a single row @where i.Country == "France" .&& i.Year >= 2000 # Select a group of columns, eventually changing their name: @select {i.Year, FranceForArea=i.forarea} # or just `i` for the whole row @collect DataFrame end2×2 DataFrame Row │ Year FranceForArea │ Int64 Float64 ─────┼────────────────────── 1 │ 2000 15.288 2 │ 2020 17.253

long but flexible

Managing missing values

Tip

See also the section Missingness implementations for a general discussion on missing values. BetaML has now several algorithms for missing imputation.

julia> df = copy(data)
       # df[3,"forarea"]  = missing # Error, type is Flat64, not Union{Float64,Missing}8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> df.forarea = allowmissing(df.forarea) # also disallowmissing8-element Vector{Union{Missing, Float64}}: 11.354 15.288 8.36925 28.163 11.419 17.253 9.56613 27.98
julia> allowmissing!(df)8×4 DataFrame Row │ Country Year forarea forvol │ String? Int64? Float64? Float64? ─────┼───────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> df[3,"forarea"] = missingmissing
julia> df[6,"forarea"] = missingmissing
julia> df[6,"Country"] = missingmissing
julia> nMissings = length(findall(x -> ismissing(x), df.forarea)) # Count `missing` values in a column.2
julia> dropmissing(df)6×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Sweden 2000 28.163 3184.67 4 │ Germany 2020 11.419 3663.0 5 │ Italy 2020 9.56613 1424.4 6 │ Sweden 2020 27.98 3653.91
julia> dropmissing(df[:,["forarea","forvol"]])6×2 DataFrame Row │ forarea forvol │ Float64 Float64 ─────┼─────────────────── 1 │ 11.354 3381.0 2 │ 15.288 2254.28 3 │ 28.163 3184.67 4 │ 11.419 3663.0 5 │ 9.56613 1424.4 6 │ 27.98 3653.91
julia> collect(skipmissing(df.forarea))6-element Vector{Float64}: 11.354 15.288 28.163 11.419 9.56613 27.98
julia> completecases(df)8-element BitVector: 1 1 0 1 1 0 1 1
julia> completecases(df[!,["forarea","forvol"]])8-element BitVector: 1 1 0 1 1 0 1 1
julia> [df[ismissing.(df[!,col]), col] .= 0 for col in names(df) if nonmissingtype(eltype(df[!,col])) <: Number] # Replace `missing` with `0` values in all numeric columns, like `Float64` and `Int64`;3-element Vector{SubArray{T, 1, P, Tuple{Vector{Int64}}, false} where {T, P}}: 0-element view(::Vector{Union{Missing, Int64}}, Int64[]) with eltype Union{Missing, Int64} Union{Missing, Float64}[0.0, 0.0] 0-element view(::Vector{Union{Missing, Float64}}, Int64[]) with eltype Union{Missing, Float64}
julia> [df[ismissing.(df[!,col]), col] .= "" for col in names(df) if nonmissingtype(eltype(df[!,col])) <: AbstractString] # Replace `missing` with `""` values in all string columns;1-element Vector{SubArray{Union{Missing, String}, 1, Vector{Union{Missing, String}}, Tuple{Vector{Int64}}, false}}: [""]
julia> df8×4 DataFrame Row │ Country Year forarea forvol │ String? Int64? Float64? Float64? ─────┼───────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 0.0 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ 2020 0.0 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91

Editing data

julia> df = copy(data)8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> df[1,"forarea"] = 11.311.3
julia> df[[2,4],"forarea"] .= 102-element view(::Vector{Float64}, [2, 4]) with eltype Float64: 10.0 10.0
julia> df8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.3 3381.0 2 │ France 2000 10.0 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 10.0 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> push!(df,["UK",2020,5.0,800.0]) # add row9×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.3 3381.0 2 │ France 2000 10.0 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 10.0 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91 9 │ UK 2020 5.0 800.0
julia> sort!(df,["Country","Year"], rev=false)9×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ France 2000 10.0 2254.28 2 │ France 2020 17.253 3055.83 3 │ Germany 2000 11.3 3381.0 4 │ Germany 2020 11.419 3663.0 5 │ Italy 2000 8.36925 1058.71 6 │ Italy 2020 9.56613 1424.4 7 │ Sweden 2000 10.0 3184.67 8 │ Sweden 2020 27.98 3653.91 9 │ UK 2020 5.0 800.0
julia> df2 = similar(df) # rubish inside9×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 1 │ #undef 0 0.0 0.0 2 │ #undef 0 0.0 0.0 3 │ #undef 0 0.0 0.0 4 │ #undef 0 0.0 0.0 5 │ #undef 0 0.0 0.0 6 │ #undef 0 0.0 0.0 7 │ #undef 0 0.0 0.0 8 │ #undef 0 0.0 0.0 9 │ #undef 0 0.0 0.0
julia> df = similar(df,0) # empty a dataframe. The second parameter is the number of rows desired0×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┴──────────────────────────────────

Work on dataframe structure

julia> df       = copy(data)8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91
julia> df.foo = [1,2,3,4,5,6,7,8] # existing or new column8-element Vector{Int64}: 1 2 3 4 5 6 7 8
julia> df.volHa = data.forvol ./ data.forarea8-element Vector{Float64}: 297.78051787916155 147.45421245421247 126.49998506437257 113.07992756453503 320.78115421665643 177.1187619544427 148.9003390085646 130.59006433166547
julia> df.goo = Array{Union{Missing,Float64},1}(missing,size(data,1))8-element Vector{Union{Missing, Float64}}: missing missing missing missing missing missing missing missing
julia> select!(df,Not(["foo","volHa","goo"])) # remove cols by name8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> rename!(df, ["Country2", "Year2", "forarea2", "forvol2"])8×4 DataFrame Row │ Country2 Year2 forarea2 forvol2 │ String Int64 Float64 Float64 ─────┼──────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> rename!(df, Dict("Country2" => "Country"))8×4 DataFrame Row │ Country Year2 forarea2 forvol2 │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> df = df[:,["Year2", "Country","forarea2","forvol2"] ] # change column order8×4 DataFrame Row │ Year2 Country forarea2 forvol2 │ Int64 String Float64 Float64 ─────┼─────────────────────────────────── 1 │ 2000 Germany 11.354 3381.0 2 │ 2000 France 15.288 2254.28 3 │ 2000 Italy 8.36925 1058.71 4 │ 2000 Sweden 28.163 3184.67 5 │ 2020 Germany 11.419 3663.0 6 │ 2020 France 17.253 3055.83 7 │ 2020 Italy 9.56613 1424.4 8 │ 2020 Sweden 27.98 3653.91
julia> insertcols!(df, 2, :foo => [1,2,3,4,5,6,7,8] ) # insert a column at position 28×5 DataFrame Row │ Year2 foo Country forarea2 forvol2 │ Int64 Int64 String Float64 Float64 ─────┼────────────────────────────────────────── 1 │ 2000 1 Germany 11.354 3381.0 2 │ 2000 2 France 15.288 2254.28 3 │ 2000 3 Italy 8.36925 1058.71 4 │ 2000 4 Sweden 28.163 3184.67 5 │ 2020 5 Germany 11.419 3663.0 6 │ 2020 6 France 17.253 3055.83 7 │ 2020 7 Italy 9.56613 1424.4 8 │ 2020 8 Sweden 27.98 3653.91
julia> df.namedYear = map(string,df.Year2)8-element Vector{String}: "2000" "2000" "2000" "2000" "2020" "2020" "2020" "2020"
julia> stringToInt(str) = try parse(Int64, str) catch; return(missing) end; df.Year3 = map(stringToInt, df.namedYear)8-element Vector{Int64}: 2000 2000 2000 2000 2020 2020 2020 2020
julia> df2 = hcat(df,data,makeunique=true)8×11 DataFrame Row │ Year2 foo Country forarea2 forvol2 namedYear Year3 Country_1 ⋯ │ Int64 Int64 String Float64 Float64 String Int64 String ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ 2000 1 Germany 11.354 3381.0 2000 2000 Germany ⋯ 2 │ 2000 2 France 15.288 2254.28 2000 2000 France 3 │ 2000 3 Italy 8.36925 1058.71 2000 2000 Italy 4 │ 2000 4 Sweden 28.163 3184.67 2000 2000 Sweden 5 │ 2020 5 Germany 11.419 3663.0 2020 2020 Germany ⋯ 6 │ 2020 6 France 17.253 3055.83 2020 2020 France 7 │ 2020 7 Italy 9.56613 1424.4 2020 2020 Italy 8 │ 2020 8 Sweden 27.98 3653.91 2020 2020 Sweden 3 columns omitted
julia> df3 = copy(data)8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> df4 = vcat(data,df3)16×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91 9 │ Germany 2000 11.354 3381.0 10 │ France 2000 15.288 2254.28 11 │ Italy 2000 8.36925 1058.71 12 │ Sweden 2000 28.163 3184.67 13 │ Germany 2020 11.419 3663.0 14 │ France 2020 17.253 3055.83 15 │ Italy 2020 9.56613 1424.4 16 │ Sweden 2020 27.98 3653.91

Categorical data

julia> using CategoricalArrays # You may want to consider also PooledArrays
julia> df.Year2 = categorical(df.Year2)8-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}: 2000 2000 2000 2000 2020 2020 2020 2020
julia> transform!(df, names(df, AbstractString) .=> categorical, renamecols=false) # transform to categorical all string columns8×7 DataFrame Row │ Year2 foo Country forarea2 forvol2 namedYear Year3 │ Cat… Int64 Cat… Float64 Float64 Cat… Int64 ─────┼──────────────────────────────────────────────────────────── 1 │ 2000 1 Germany 11.354 3381.0 2000 2000 2 │ 2000 2 France 15.288 2254.28 2000 2000 3 │ 2000 3 Italy 8.36925 1058.71 2000 2000 4 │ 2000 4 Sweden 28.163 3184.67 2000 2000 5 │ 2020 5 Germany 11.419 3663.0 2020 2020 6 │ 2020 6 France 17.253 3055.83 2020 2020 7 │ 2020 7 Italy 9.56613 1424.4 2020 2020 8 │ 2020 8 Sweden 27.98 3653.91 2020 2020
Warning

Attention that while the memory to store the data decreases, and grouping is way more efficient, filtering with categorical values is not necessarily quicker (indeed it can be a bit slower)

julia> levels(df.Year2)2-element Vector{Int64}:
 2000
 2020
julia> levels!(df.Country,["Sweden","Germany","France","Italy"]) # Let you define a personalised order, useful for ordered data8-element CategoricalArrays.CategoricalArray{String,1,UInt32}: "Germany" "France" "Italy" "Sweden" "Germany" "France" "Italy" "Sweden"
julia> sort(df.Country)8-element CategoricalArrays.CategoricalArray{String,1,UInt32}: "Sweden" "Sweden" "Germany" "Germany" "France" "France" "Italy" "Italy"
julia> sort!(df,"Country")8×7 DataFrame Row │ Year2 foo Country forarea2 forvol2 namedYear Year3 │ Cat… Int64 Cat… Float64 Float64 Cat… Int64 ─────┼──────────────────────────────────────────────────────────── 1 │ 2000 4 Sweden 28.163 3184.67 2000 2000 2 │ 2020 8 Sweden 27.98 3653.91 2020 2020 3 │ 2000 1 Germany 11.354 3381.0 2000 2000 4 │ 2020 5 Germany 11.419 3663.0 2020 2020 5 │ 2000 2 France 15.288 2254.28 2000 2000 6 │ 2020 6 France 17.253 3055.83 2020 2020 7 │ 2000 3 Italy 8.36925 1058.71 2000 2000 8 │ 2020 7 Italy 9.56613 1424.4 2020 2020
julia> df.Years2 = unwrap.(df.Year2) # convert a categorical array into a normal one.8-element Vector{Int64}: 2000 2020 2000 2020 2000 2020 2000 2020

Joining dataframes

julia> df1,df2 = copy(data),copy(data)(8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91, 8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64   Float64
─────┼───────────────────────────────────
   1 │ Germany   2000  11.354    3381.0
   2 │ France    2000  15.288    2254.28
   3 │ Italy     2000   8.36925  1058.71
   4 │ Sweden    2000  28.163    3184.67
   5 │ Germany   2020  11.419    3663.0
   6 │ France    2020  17.253    3055.83
   7 │ Italy     2020   9.56613  1424.4
   8 │ Sweden    2020  27.98     3653.91)
julia> push!(df1,["US",2020,5.0,1000.0])9×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91 9 │ US 2020 5.0 1000.0
julia> push!(df2,["China",2020,50.0,1000.0])9×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91 9 │ China 2020 50.0 1000.0
julia> rename!(df2,"Year"=>"year")9×4 DataFrame Row │ Country year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91 9 │ China 2020 50.0 1000.0
julia> innerjoin(df1,df2,on=["Country","Year"=>"year"],makeunique=true) # common records only8×6 DataFrame Row │ Country Year forarea forvol forarea_1 forvol_1 │ String Int64 Float64 Float64 Float64 Float64 ─────┼──────────────────────────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 11.354 3381.0 2 │ France 2000 15.288 2254.28 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 11.419 3663.0 6 │ France 2020 17.253 3055.83 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91 27.98 3653.91

Also available: leftjoin (all records on left df), rightjoin (all on right df), outerjoin(all records returned), semijoin (like inner join by only with columns from the right df), antijoin (left not on right df) and crossjoin (like the cartesian product, each on the right by each on the left)

Pivoting data

long and wide are two kind of layout for equivalent representation of tabled data. In a long layout we have each row being an observation of a single variable, and each record is represented as dim1, dim2, ..., value. As the name implies, "long" layouts tend to be relativelly long and hard to analyse by an human, but are very easity to handle. At the opposite, A wide layout represents multiple observations on the same row, eventually using multiple horizontal axis as in the next figure (but Julia dataframes handle only a single horizzontal axis):

Forest AreaForest Volumes
2000202020002020
Germany11.3511.4233813663
France15.2917.2522543056

_wide layout is easier to visually analise for a human mind but much more hard to analyse in a sistermatic way. We will learn now how to move from one type of layout to the other.

Stacking columns: from wide to long

julia> longDf  = stack(data,["forarea","forvol"])    # we specify the variable to stack (the "measured" variables)16×4 DataFrame
 Row │ Country  Year   variable  value
     │ String   Int64  String    Float64
─────┼──────────────────────────────────────
   1 │ Germany   2000  forarea     11.354
   2 │ France    2000  forarea     15.288
   3 │ Italy     2000  forarea      8.36925
   4 │ Sweden    2000  forarea     28.163
   5 │ Germany   2020  forarea     11.419
   6 │ France    2020  forarea     17.253
   7 │ Italy     2020  forarea      9.56613
   8 │ Sweden    2020  forarea     27.98
   9 │ Germany   2000  forvol    3381.0
  10 │ France    2000  forvol    2254.28
  11 │ Italy     2000  forvol    1058.71
  12 │ Sweden    2000  forvol    3184.67
  13 │ Germany   2020  forvol    3663.0
  14 │ France    2020  forvol    3055.83
  15 │ Italy     2020  forvol    1424.4
  16 │ Sweden    2020  forvol    3653.91
julia> longDf2 = stack(data,Not(["Country","Year"])) # we specify the variables _not_ to stack (the id variables)16×4 DataFrame Row │ Country Year variable value │ String Int64 String Float64 ─────┼────────────────────────────────────── 1 │ Germany 2000 forarea 11.354 2 │ France 2000 forarea 15.288 3 │ Italy 2000 forarea 8.36925 4 │ Sweden 2000 forarea 28.163 5 │ Germany 2020 forarea 11.419 6 │ France 2020 forarea 17.253 7 │ Italy 2020 forarea 9.56613 8 │ Sweden 2020 forarea 27.98 9 │ Germany 2000 forvol 3381.0 10 │ France 2000 forvol 2254.28 11 │ Italy 2000 forvol 1058.71 12 │ Sweden 2000 forvol 3184.67 13 │ Germany 2020 forvol 3663.0 14 │ France 2020 forvol 3055.83 15 │ Italy 2020 forvol 1424.4 16 │ Sweden 2020 forvol 3653.91
julia> longDf3 = stack(data) # automatically stack all numerical variables16×4 DataFrame Row │ Country Year variable value │ String Int64 String Float64 ─────┼────────────────────────────────────── 1 │ Germany 2000 forarea 11.354 2 │ France 2000 forarea 15.288 3 │ Italy 2000 forarea 8.36925 4 │ Sweden 2000 forarea 28.163 5 │ Germany 2020 forarea 11.419 6 │ France 2020 forarea 17.253 7 │ Italy 2020 forarea 9.56613 8 │ Sweden 2020 forarea 27.98 9 │ Germany 2000 forvol 3381.0 10 │ France 2000 forvol 2254.28 11 │ Italy 2000 forvol 1058.71 12 │ Sweden 2000 forvol 3184.67 13 │ Germany 2020 forvol 3663.0 14 │ France 2020 forvol 3055.83 15 │ Italy 2020 forvol 1424.4 16 │ Sweden 2020 forvol 3653.91
julia> longDf == longDf2 == longDf3true

Note how the columns variable and value have been added automatically to host the stachked data

Unstacking columns: from wide to long

julia> wideDf = unstack(longDf,["Country","Year"],"variable","value") # args: df, [cols to remains cols also in the wide layout], column with the ids to expand horizontally and column with the relative values8×4 DataFrame
 Row │ Country  Year   forarea   forvol
     │ String   Int64  Float64?  Float64?
─────┼────────────────────────────────────
   1 │ Germany   2000  11.354     3381.0
   2 │ France    2000  15.288     2254.28
   3 │ Italy     2000   8.36925   1058.71
   4 │ Sweden    2000  28.163     3184.67
   5 │ Germany   2020  11.419     3663.0
   6 │ France    2020  17.253     3055.83
   7 │ Italy     2020   9.56613   1424.4
   8 │ Sweden    2020  27.98      3653.91
julia> wideDf2 = unstack(longDf,"variable","value") # cols to remains cols also in the wide layout omitted: all cols not to expand and relative value col remains as col8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64? Float64? ─────┼──────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> wideDf == wideDf2 == datatrue

While the DataFrames package doesn't support multiple axis we can still arrive to the table below with a little bit of work by unstacking different columns in separate wide dataframes and then joining or horizontally concatenating them:

julia> wideArea = unstack(data,"Country","Year","forarea")4×3 DataFrame
 Row │ Country  2000      2020
     │ String   Float64?  Float64?
─────┼─────────────────────────────
   1 │ Germany  11.354    11.419
   2 │ France   15.288    17.253
   3 │ Italy     8.36925   9.56613
   4 │ Sweden   28.163    27.98
julia> wideVols = unstack(data,"Country","Year","forvol")4×3 DataFrame Row │ Country 2000 2020 │ String Float64? Float64? ─────┼───────────────────────────── 1 │ Germany 3381.0 3663.0 2 │ France 2254.28 3055.83 3 │ Italy 1058.71 1424.4 4 │ Sweden 3184.67 3653.91
julia> rename!(wideArea,["Country","area_2000","area_2020"])4×3 DataFrame Row │ Country area_2000 area_2020 │ String Float64? Float64? ─────┼─────────────────────────────── 1 │ Germany 11.354 11.419 2 │ France 15.288 17.253 3 │ Italy 8.36925 9.56613 4 │ Sweden 28.163 27.98
julia> rename!(wideVols,["Country","vol_2000","vol_2020"])4×3 DataFrame Row │ Country vol_2000 vol_2020 │ String Float64? Float64? ─────┼───────────────────────────── 1 │ Germany 3381.0 3663.0 2 │ France 2254.28 3055.83 3 │ Italy 1058.71 1424.4 4 │ Sweden 3184.67 3653.91
julia> wideWideDf = outerjoin(wideArea,wideVols,on="Country")4×5 DataFrame Row │ Country area_2000 area_2020 vol_2000 vol_2020 │ String Float64? Float64? Float64? Float64? ─────┼─────────────────────────────────────────────────── 1 │ Germany 11.354 11.419 3381.0 3663.0 2 │ France 15.288 17.253 2254.28 3055.83 3 │ Italy 8.36925 9.56613 1058.71 1424.4 4 │ Sweden 28.163 27.98 3184.67 3653.91

The Split-Apply-Combine strategy

Aka "divide and conquer". Rather than try to modify the dataset direclty, we first split it in subparts, we work on each subpart and then we recombine them in a target dataset

julia> using Statistics # for `mean`
julia> groupby(data,["Country","Year"]) # The "split" partGroupedDataFrame with 8 groups based on keys: Country, Year First Group (1 row): Country = "Germany", Year = 2000 Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 ⋮ Last Group (1 row): Country = "Sweden", Year = 2020 Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 1 │ Sweden 2020 27.98 3653.91

Aggregation:

julia> combine(groupby(data,["Year"]) ,  "forarea" => sum => "sum_area", "forvol" => sum => "sum_vol", nrow)2×4 DataFrame
 Row │ Year   sum_area  sum_vol   nrow
     │ Int64  Float64   Float64   Int64
─────┼──────────────────────────────────
   1 │  2000   63.1743   9878.66      4
   2 │  2020   66.2181  11797.1       4

...or...

julia> combine(groupby(data,["Year"])) do subdf # slower
           (sumarea = sum(subdf.forarea), sumvol = sum(subdf.forvol), nCountries = size(subdf,1))
       end2×4 DataFrame
 Row │ Year   sumarea  sumvol    nCountries
     │ Int64  Float64  Float64   Int64
─────┼──────────────────────────────────────
   1 │  2000  63.1743   9878.66           4
   2 │  2020  66.2181  11797.1            4

Cumulative computation:

julia> a = combine(groupby(data,["Year"])) do subdf # slower
           (country = subdf.Country, area = subdf.forarea, cumArea = cumsum(subdf.forarea))
       end8×4 DataFrame
 Row │ Year   country  area      cumArea
     │ Int64  String   Float64   Float64
─────┼───────────────────────────────────
   1 │  2000  Germany  11.354    11.354
   2 │  2000  France   15.288    26.642
   3 │  2000  Italy     8.36925  35.0112
   4 │  2000  Sweden   28.163    63.1743
   5 │  2020  Germany  11.419    11.419
   6 │  2020  France   17.253    28.672
   7 │  2020  Italy     9.56613  38.2381
   8 │  2020  Sweden   27.98     66.2181

Note in these examples that while in the aggregation we was returning a single record for each subgroup (hence we did some dimensionality reduction) in the cumulative compuation we still output the whole subgroup, so the combined dataframe in output has the same number of rows as the original dataframe.

An alternative approach is to use the @linq macro from the DatAFrameMEta package that provide a R's dplyr-like query language using piped data:

julia> using DataFramesMeta
julia> dfCum = @linq data |> groupby([:Year]) |> transform(:cumArea = cumsum(:forarea))8×5 DataFrame Row │ Country Year forarea forvol cumArea │ String Int64 Float64 Float64 Float64 ─────┼──────────────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 11.354 2 │ France 2000 15.288 2254.28 26.642 3 │ Italy 2000 8.36925 1058.71 35.0112 4 │ Sweden 2000 28.163 3184.67 63.1743 5 │ Germany 2020 11.419 3663.0 11.419 6 │ France 2020 17.253 3055.83 28.672 7 │ Italy 2020 9.56613 1424.4 38.2381 8 │ Sweden 2020 27.98 3653.91 66.2181

Export and saving

DataFrame to Matrix

julia> M = Matrix(data)8×4 Matrix{Any}:
 "Germany"  2000  11.354    3381.0
 "France"   2000  15.288    2254.28
 "Italy"    2000   8.36925  1058.71
 "Sweden"   2000  28.163    3184.67
 "Germany"  2020  11.419    3663.0
 "France"   2020  17.253    3055.83
 "Italy"    2020   9.56613  1424.4
 "Sweden"   2020  27.98     3653.91
Warning

Attention that if the dataframe contains different types across columns, the inner type of the matrix will be Any

julia> M = Matrix{Union{Float64,Int64,String}}(data)8×4 Matrix{Union{Float64, Int64, String}}:
 "Germany"  2000  11.354    3381.0
 "France"   2000  15.288    2254.28
 "Italy"    2000   8.36925  1058.71
 "Sweden"   2000  28.163    3184.67
 "Germany"  2020  11.419    3663.0
 "France"   2020  17.253    3055.83
 "Italy"    2020   9.56613  1424.4
 "Sweden"   2020  27.98     3653.91

DataFrame to Dictionary

julia> function toDict(df, dimCols, valueCol)
           toReturn = Dict()
           for r in eachrow(df)
               keyValues = []
               [push!(keyValues,r[d]) for d in dimCols]
               toReturn[(keyValues...,)] = r[valueCol]
           end
           return toReturn
       endtoDict (generic function with 1 method)
julia> dict = toDict(data,["Country","Year"],["forarea","forvol"])Dict{Any, Any} with 8 entries: ("Germany", 2020) => DataFrameRow… ("Italy", 2020) => DataFrameRow… ("Italy", 2000) => DataFrameRow… ("Germany", 2000) => DataFrameRow… ("Sweden", 2000) => DataFrameRow… ("France", 2020) => DataFrameRow… ("France", 2000) => DataFrameRow… ("Sweden", 2020) => DataFrameRow
julia> dict["Germany",2000][1]11.354
julia> dict["Germany",2000]["forvol"]3381.0
julia> toDict(data,["Country","Year"],"forarea")Dict{Any, Any} with 8 entries: ("Germany", 2020) => 11.419 ("Italy", 2020) => 9.56613 ("Italy", 2000) => 8.36925 ("Germany", 2000) => 11.354 ("Sweden", 2000) => 28.163 ("France", 2020) => 17.253 ("France", 2000) => 15.288 ("Sweden", 2020) => 27.98

DataFrame to NamedTuple

julia> nT = NamedTuple(Dict([Symbol(c) => data[:,c]  for c in names(data)]))        # order not necessarily preserved(Country = ["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"], forvol = [3381.0, 2254.28, 1058.71, 3184.67, 3663.0, 3055.83, 1424.4, 3653.91], Year = [2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020], forarea = [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98])
julia> using DataStructures
julia> nT = NamedTuple(OrderedDict([Symbol(c) => data[:,c] for c in names(data)])) # order preserved(Country = ["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"], Year = [2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020], forarea = [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98], forvol = [3381.0, 2254.28, 1058.71, 3184.67, 3663.0, 3055.83, 1424.4, 3653.91])

Saving as CSV file

julia> rm("outdata.csv", force=true)
julia> CSV.write("outdata.csv",data) # see options at the beginning of segment in the import section and `? CSV.write` for specific export options"outdata.csv"

Saving as OpenDocument spreadsheet

julia> rm("outdata.ods", force=true)
julia> ods_write("outdata.ods",Dict(("myData",3,2) => data)) # exported starting on cell B3 of sheet "myData"

Saving as Excel spreadsheet

julia> rm("outdata.xlsx", force=true)
julia> XLSX.writetable("outdata.xlsx",myData = (collect(eachcol(data)),names(data)))

View this file on Github.


This page was generated using Literate.jl.