02 JULIA2 - 1A: Introduction and data import (18:28)
02 JULIA2 - 1B: Getting insights of the data (25:26)
02 JULIA2 - 1C: Edit data and dataframe structure (23:40)
02 JULIA2 - 1D: Pivot, Split-Apply-Combine and data export (23:39)
0201 - Data Wrangling
Some stuff to set-up the environment..
julia> cd(@__DIR__)
julia> using Pkg
julia> Pkg.activate(".") # If using a Julia version different than 1.10 please uncomment and run the following line (reproductibility guarantee will hower be lost) #Pkg.resolve()
Activating project at `~/work/SPMLJ/SPMLJ/buildedDoc/02_-_JULIA2_-_Scientific_programming_with_Julia`
julia> Pkg.instantiate()
julia> using Random
julia> Random.seed!(123)
Random.TaskLocalRNG()
Introduction
This segment will be mostly based on DataFrames and related packages
!!! info DataFrames vs Matrix DataFrames are popular format for in-memory tabular data. Their main advantages over Arrays are that they can efficiently store different types of data on each column (indeed each column is a wrapper over an Array{T,1}
where T
is specific to each column) and, thanks also to their named columns, provide convenient API for data operations, like indexing, querying , joining, split-apply-combine, etc.
In most circunstances we can refer to dataframe columns either by using their name as a string, e.g. "Region"
, or as symbol, e.g. :Region
. In the rest of the segment we will use the string approach.
Data import
Our example: Forest volumes and area by country and year Source: Eurostat; units: forarea: Milion hectars, forvol: Milion cubic metres
Built-in solution: CSV –> Matrix
julia> using DelimitedFiles # in the stdlib
julia> data = convert(Array{Float64,2},readdlm("data.csv",';')[2:end,3:end]) # Skip the first 1 row and the first 2 columns - out is a Matrix
8×2 Matrix{Float64}: 11.354 3381.0 15.288 2254.28 8.36925 1058.71 28.163 3184.67 11.419 3663.0 17.253 3055.83 9.56613 1424.4 27.98 3653.91
CSV.jl: {CSV file, hardwritten data} –> DataFrame
julia> using CSV, DataFrames
julia> data = CSV.read("data.csv",DataFrame) # source, sink, kword options
8×4 DataFrame Row │ Country Year forarea forvol │ String7 Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> data = CSV.read("data.csv",NamedTuple)
(Country = InlineStrings.String7["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"], Year = [2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020], forarea = [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98], forvol = [3381.0, 2254.28, 1058.71, 3184.67, 3663.0, 3055.83, 1424.4, 3653.91])
julia> data = CSV.read(IOBuffer(""" Country Year forarea forvol Germany 2000 11.354 3381 France 2000 15.288 2254.28 Italy 2000 8.36925 1058.71 Sweden 2000 28.163 3184.67 Germany 2020 11.419 3663 France 2020 17.253 3055.83 Italy 2020 9.56613 1424.4 Sweden 2020 27.98 3653.91 """), DataFrame, copycols=true)
8×4 DataFrame Row │ Country Year forarea forvol │ String7 Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
Some common CSV.jl options: delim
(use '\t'
for tab delimited files), quotechar
, openquotechar
, closequotechar
, escapechar
, missingstring
, dateformat
, append
, writeheader
, header
, newline
, quotestrings
, decimal
, header
, normalizenames
, datarow
, skipto
, footerskip
, limit
, transpose
, comment
, use_mmap
, type
, types
(e.g. types=Dict("fieldFoo" => Union{Missing,Int64})
), typemap
, pool
, categorical
, strict
, silencewarnings
, ignorerepeated
XLSX.jl: xlsx -> {Matrix, DataFrame}
julia> using XLSX
julia> sheetNames = XLSX.sheetnames(XLSX.readxlsx("data.xlsx"))
1-element Vector{String}: "Sheet1"
julia> data = XLSX.readxlsx("data.xlsx")["Sheet1"]["A1:D9"]
9×4 Matrix{Any}: "Country" "Year" "forarea" "forvol" "Germany" 2000 11.354 3381 "France" 2000 15.288 2254.28 "Italy" 2000 8.36925 1058.71 "Sweden" 2000 28.163 3184.67 "Germany" 2020 11.419 3663 "France" 2020 17.253 3055.83 "Italy" 2020 9.56613 1424.4 "Sweden" 2020 27.98 3653.91
julia> data = XLSX.readxlsx("data.xlsx")["Sheet1"][:]
9×4 Matrix{Any}: "Country" "Year" "forarea" "forvol" "Germany" 2000 11.354 3381 "France" 2000 15.288 2254.28 "Italy" 2000 8.36925 1058.71 "Sweden" 2000 28.163 3184.67 "Germany" 2020 11.419 3663 "France" 2020 17.253 3055.83 "Italy" 2020 9.56613 1424.4 "Sweden" 2020 27.98 3653.91
julia> data = XLSX.readdata("data.xlsx", "Sheet1", "A1:D9")
9×4 Matrix{Any}: "Country" "Year" "forarea" "forvol" "Germany" 2000 11.354 3381 "France" 2000 15.288 2254.28 "Italy" 2000 8.36925 1058.71 "Sweden" 2000 28.163 3184.67 "Germany" 2020 11.419 3663 "France" 2020 17.253 3055.83 "Italy" 2020 9.56613 1424.4 "Sweden" 2020 27.98 3653.91
julia> XLSX.readtable("data.xlsx", "Sheet1") # a specific XLSX data structure usable as DF constructor
XLSX.DataTable(Any[Any["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"], Any[2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020], Any[11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98], Any[3381, 2254.28, 1058.71, 3184.67, 3663, 3055.83, 1424.4, 3653.91]], [:Country, :Year, :forarea, :forvol], Dict(:Country => 1, :forvol => 4, :Year => 2, :forarea => 3))
julia> data = DataFrame(XLSX.readtable("data.xlsx", "Sheet1"))
8×4 DataFrame Row │ Country Year forarea forvol │ Any Any Any Any ─────┼───────────────────────────────── 1 │ Germany 2000 11.354 3381 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
OdsIO.jl: ods -> {Matrix, DataFrame}
julia> using OdsIO
julia> ods_read("data.ods";sheetName="Sheet1",retType="DataFrame")
8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> ods_read("data.ods";sheetName="Sheet1",retType="Matrix",range=[(2,3),(9,4)]) # [(tlr,tlc),(brr,brc)]
8×2 Matrix{Any}: 11.354 3381 15.288 2254.28 8.36925 1058.71 28.163 3184.67 11.419 3663 17.253 3055.83 9.56613 1424.4 27.98 3653.91
HTTP.jl: from internet
julia> import HTTP, Pipe.@pipe
julia> using ZipFile, Tar
julia> urlData = "https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.csv"
"https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.csv"
julia> urlDataZ = "https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.zip"
"https://github.com/sylvaticus/IntroSPMLJuliaCourse/raw/main/lessonsSources/02_-_JULIA2_-_Scientific_programming_with_Julia/data.zip"
julia> data = @pipe HTTP.get(urlData).body |> replace!(_, UInt8(';') => UInt8(' ')) |> # if we need to do modifications to the file before importing CSV.File(_, delim=' ') |> DataFrame;
ZipFile.jl : from a single zipped csv file..
...on disk...
julia> data = @pipe ZipFile.Reader("data.zip").files[1] |> CSV.File(read(_), delim=';') |> DataFrame
8×4 DataFrame Row │ Country Year forarea forvol │ String7 Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
...or on internet:
julia> data = @pipe HTTP.get(urlDataZ).body |> IOBuffer(_) |> ZipFile.Reader(_).files[1] |> CSV.File(read(_), delim=';') |> DataFrame
8×4 DataFrame Row │ Country Year forarea forvol │ String7 Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
DataFrame constructor
julia> # named tuple of (colName,colData).. data = DataFrame( country = ["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"], year = [2000,2000,2000,2000,2020,2020,2020,2020], forarea = [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98], forvol = [3381, 2254.28, 1058.71, 3184.67, 3663, 3055.83, 1424.4, 3653.91] )
8×4 DataFrame Row │ country year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
Matrix -> DataFrame
Headers and data separated:
julia> M = ["Germany" 2000 11.354 3381 "France" 2000 15.288 2254.28 "Italy" 2000 8.36925 1058.71 "Sweden" 2000 28.163 3184.67 "Germany" 2020 11.419 3663 "France" 2020 17.253 3055.83 "Italy" 2020 9.56613 1424.4 "Sweden" 2020 27.98 3653.91]
8×4 Matrix{Any}: "Germany" 2000 11.354 3381 "France" 2000 15.288 2254.28 "Italy" 2000 8.36925 1058.71 "Sweden" 2000 28.163 3184.67 "Germany" 2020 11.419 3663 "France" 2020 17.253 3055.83 "Italy" 2020 9.56613 1424.4 "Sweden" 2020 27.98 3653.91
julia> headers = ["Country", "Year", "forarea", "forvol"] # array of colData arrays, array of headers
4-element Vector{String}: "Country" "Year" "forarea" "forvol"
julia> data = DataFrame([[M[:,i]...] for i in 1:size(M,2)], Symbol.(headers))
8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
Headers on the first row of the matrix:
julia> M = ["Country" "Year" "forarea" "forvol" "Germany" 2000 11.354 3381 "France" 2000 15.288 2254.28 "Italy" 2000 8.36925 1058.71 "Sweden" 2000 28.163 3184.67 "Germany" 2020 11.419 3663 "France" 2020 17.253 3055.83 "Italy" 2020 9.56613 1424.4 "Sweden" 2020 27.98 3653.91]
9×4 Matrix{Any}: "Country" "Year" "forarea" "forvol" "Germany" 2000 11.354 3381 "France" 2000 15.288 2254.28 "Italy" 2000 8.36925 1058.71 "Sweden" 2000 28.163 3184.67 "Germany" 2020 11.419 3663 "France" 2020 17.253 3055.83 "Italy" 2020 9.56613 1424.4 "Sweden" 2020 27.98 3653.91
julia> data = DataFrame([[M[2:end,i]...] for i in 1:size(M,2)], Symbol.(M[1,:])) # note the autorecognision of col types
8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
Getting insights on the data
In VSCode, we can also use the workpanel for a nice sortable tabular view of a df
julia> show(data,allrows=true,allcols=true)
8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> first(data,6)
6×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83
julia> last(data, 6)
6×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Italy 2000 8.36925 1058.71 2 │ Sweden 2000 28.163 3184.67 3 │ Germany 2020 11.419 3663.0 4 │ France 2020 17.253 3055.83 5 │ Italy 2020 9.56613 1424.4 6 │ Sweden 2020 27.98 3653.91
julia> describe(data)
4×7 DataFrame Row │ variable mean min median max nmissing eltype │ Symbol Union… Any Union… Any Int64 DataType ─────┼───────────────────────────────────────────────────────────────── 1 │ Country France Sweden 0 String 2 │ Year 2010.0 2000 2010.0 2020 0 Int64 3 │ forarea 16.174 8.36925 13.3535 28.163 0 Float64 4 │ forvol 2709.47 1058.71 3120.25 3663.0 0 Float64
julia> nR,nC = size(data)
(8, 4)
julia> names(data)
4-element Vector{String}: "Country" "Year" "forarea" "forvol"
julia> for r in eachrow(data) println(r) # note is it a "DataFrameRow" end
DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 2 │ France 2000 15.288 2254.28 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 3 │ Italy 2000 8.36925 1058.71 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 4 │ Sweden 2000 28.163 3184.67 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 5 │ Germany 2020 11.419 3663.0 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 6 │ France 2020 17.253 3055.83 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 7 │ Italy 2020 9.56613 1424.4 DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 8 │ Sweden 2020 27.98 3653.91
julia> for c in eachcol(data) println(c) # an array println(nonmissingtype(eltype(c))) end
["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"] String [2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020] Int64 [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98] Float64 [3381.0, 2254.28, 1058.71, 3184.67, 3663.0, 3055.83, 1424.4, 3653.91] Float64
Selection and querying data
Column(s) selection
In general we can select: (a) by name (strings or symbols) or by position; (b) a single or a vector of columns, (c) copying or making a view (a reference without copying) of the underlying data
julia> data[:,["Country","Year"]] # copy, using strings
8×2 DataFrame Row │ Country Year │ String Int64 ─────┼──────────────── 1 │ Germany 2000 2 │ France 2000 3 │ Italy 2000 4 │ Sweden 2000 5 │ Germany 2020 6 │ France 2020 7 │ Italy 2020 8 │ Sweden 2020
julia> data[!,[:Country,:Year]] # view, using symbols
8×2 DataFrame Row │ Country Year │ String Int64 ─────┼──────────────── 1 │ Germany 2000 2 │ France 2000 3 │ Italy 2000 4 │ Sweden 2000 5 │ Germany 2020 6 │ France 2020 7 │ Italy 2020 8 │ Sweden 2020
julia> data.Year # equiv. to `data[!,Year]`
8-element Vector{Int64}: 2000 2000 2000 2000 2020 2020 2020 2020
julia> data[:,1] # also `data[!,1]`
8-element Vector{String}: "Germany" "France" "Italy" "Sweden" "Germany" "France" "Italy" "Sweden"
julia> data[:,Not(["Year"])]
8×3 DataFrame Row │ Country forarea forvol │ String Float64 Float64 ─────┼──────────────────────────── 1 │ Germany 11.354 3381.0 2 │ France 15.288 2254.28 3 │ Italy 8.36925 1058.71 4 │ Sweden 28.163 3184.67 5 │ Germany 11.419 3663.0 6 │ France 17.253 3055.83 7 │ Italy 9.56613 1424.4 8 │ Sweden 27.98 3653.91
Row(s) selection
julia> data[1,:] # DataFrameRow
DataFrameRow Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 1 │ Germany 2000 11.354 3381.0
julia> data[1:3,:] # DataFrame
3×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71
Note rows have no title names as colums do.
Cell(s) selection
julia> data[2,[2,4]]
DataFrameRow Row │ Year forvol │ Int64 Float64 ─────┼──────────────── 2 │ 2000 2254.28
julia> data[2,"forarea"]
15.288
Note that the returned selection is:
- an
Array{T,1}
if it is a single column; - a
DataFrameRow
(similar in behaviour to aDataFrame
) if a single row; T
if a single cell;- an other
DataFrame
otherwise.
Boolean selection
Both rows and column of a DataFrame (but also of an Matrix) can be selected by passing an array of booleans as column or row mask (and, only for Matrices, also a matrix of booleans)
julia> mask = [false, false, false, false, true, true, true, true]
8-element Vector{Bool}: 0 0 0 0 1 1 1 1
julia> data[mask,:]
4×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2020 11.419 3663.0 2 │ France 2020 17.253 3055.83 3 │ Italy 2020 9.56613 1424.4 4 │ Sweden 2020 27.98 3653.91
julia> mask = fill(false,nR,nC)
8×4 Matrix{Bool}: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
julia> mask[2:end,2:3] .= true
7×2 view(::Matrix{Bool}, 2:8, 2:3) with eltype Bool: 1 1 1 1 1 1 1 1 1 1 1 1 1 1
julia> mask # data[mask] # error !
8×4 Matrix{Bool}: 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
julia> Matrix(data)[mask]
14-element Vector{Any}: 2000 2000 2000 2020 2020 2020 2020 15.288 8.36925 28.163 11.419 17.253 9.56613 27.98
Boolean selection can be used to filter on conditions, e.g.:
julia> data[data.Year .>= 2020,:]
4×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2020 11.419 3663.0 2 │ France 2020 17.253 3055.83 3 │ Italy 2020 9.56613 1424.4 4 │ Sweden 2020 27.98 3653.91
julia> data[[i in ["France", "Italy"] for i in data.Country] .&& (data.Year .== 2000),:] # note the parhenthesis
2×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ France 2000 15.288 2254.28 2 │ Italy 2000 8.36925 1058.71
Filtering using the @subset macro from the DataFramesMacro
package
julia> using DataFramesMeta
julia> @subset(data, :Year .> 2010 )
4×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2020 11.419 3663.0 2 │ France 2020 17.253 3055.83 3 │ Italy 2020 9.56613 1424.4 4 │ Sweden 2020 27.98 3653.91
julia> colToFilter = :Country
:Country
julia> @subset(data, :Year .> 2010, cols(colToFilter) .== "France" ) # Conditions are "end" by default. If the column name is embedded in a varaible we eed to use `cols(varname)`
1×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 1 │ France 2020 17.253 3055.83
Filtering using the Query
package
julia> using Query
julia> dfOut = @from i in data begin # `i` is a single row @where i.Country == "France" .&& i.Year >= 2000 # Select a group of columns, eventually changing their name: @select {i.Year, FranceForArea=i.forarea} # or just `i` for the whole row @collect DataFrame end
2×2 DataFrame Row │ Year FranceForArea │ Int64 Float64 ─────┼────────────────────── 1 │ 2000 15.288 2 │ 2020 17.253
long but flexible
Managing missing values
See also the section Missingness implementations
for a general discussion on missing values. BetaML has now several algorithms for missing imputation.
julia> df = copy(data) # df[3,"forarea"] = missing # Error, type is Flat64, not Union{Float64,Missing}
8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> df.forarea = allowmissing(df.forarea) # also disallowmissing
8-element Vector{Union{Missing, Float64}}: 11.354 15.288 8.36925 28.163 11.419 17.253 9.56613 27.98
julia> allowmissing!(df)
8×4 DataFrame Row │ Country Year forarea forvol │ String? Int64? Float64? Float64? ─────┼───────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> df[3,"forarea"] = missing
missing
julia> df[6,"forarea"] = missing
missing
julia> df[6,"Country"] = missing
missing
julia> nMissings = length(findall(x -> ismissing(x), df.forarea)) # Count `missing` values in a column.
2
julia> dropmissing(df)
6×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Sweden 2000 28.163 3184.67 4 │ Germany 2020 11.419 3663.0 5 │ Italy 2020 9.56613 1424.4 6 │ Sweden 2020 27.98 3653.91
julia> dropmissing(df[:,["forarea","forvol"]])
6×2 DataFrame Row │ forarea forvol │ Float64 Float64 ─────┼─────────────────── 1 │ 11.354 3381.0 2 │ 15.288 2254.28 3 │ 28.163 3184.67 4 │ 11.419 3663.0 5 │ 9.56613 1424.4 6 │ 27.98 3653.91
julia> collect(skipmissing(df.forarea))
6-element Vector{Float64}: 11.354 15.288 28.163 11.419 9.56613 27.98
julia> completecases(df)
8-element BitVector: 1 1 0 1 1 0 1 1
julia> completecases(df[!,["forarea","forvol"]])
8-element BitVector: 1 1 0 1 1 0 1 1
julia> [df[ismissing.(df[!,col]), col] .= 0 for col in names(df) if nonmissingtype(eltype(df[!,col])) <: Number] # Replace `missing` with `0` values in all numeric columns, like `Float64` and `Int64`;
3-element Vector{SubArray{T, 1, P, Tuple{Vector{Int64}}, false} where {T, P}}: 0-element view(::Vector{Union{Missing, Int64}}, Int64[]) with eltype Union{Missing, Int64} Union{Missing, Float64}[0.0, 0.0] 0-element view(::Vector{Union{Missing, Float64}}, Int64[]) with eltype Union{Missing, Float64}
julia> [df[ismissing.(df[!,col]), col] .= "" for col in names(df) if nonmissingtype(eltype(df[!,col])) <: AbstractString] # Replace `missing` with `""` values in all string columns;
1-element Vector{SubArray{Union{Missing, String}, 1, Vector{Union{Missing, String}}, Tuple{Vector{Int64}}, false}}: [""]
julia> df
8×4 DataFrame Row │ Country Year forarea forvol │ String? Int64? Float64? Float64? ─────┼───────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 0.0 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ 2020 0.0 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
Editing data
julia> df = copy(data)
8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> df[1,"forarea"] = 11.3
11.3
julia> df[[2,4],"forarea"] .= 10
2-element view(::Vector{Float64}, [2, 4]) with eltype Float64: 10.0 10.0
julia> df
8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.3 3381.0 2 │ France 2000 10.0 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 10.0 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> push!(df,["UK",2020,5.0,800.0]) # add row
9×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.3 3381.0 2 │ France 2000 10.0 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 10.0 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91 9 │ UK 2020 5.0 800.0
julia> sort!(df,["Country","Year"], rev=false)
9×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ France 2000 10.0 2254.28 2 │ France 2020 17.253 3055.83 3 │ Germany 2000 11.3 3381.0 4 │ Germany 2020 11.419 3663.0 5 │ Italy 2000 8.36925 1058.71 6 │ Italy 2020 9.56613 1424.4 7 │ Sweden 2000 10.0 3184.67 8 │ Sweden 2020 27.98 3653.91 9 │ UK 2020 5.0 800.0
julia> df2 = similar(df) # rubish inside
9×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 1 │ #undef 0 0.0 0.0 2 │ #undef 0 0.0 0.0 3 │ #undef 0 0.0 0.0 4 │ #undef 0 0.0 0.0 5 │ #undef 0 0.0 0.0 6 │ #undef 0 0.0 0.0 7 │ #undef 0 0.0 0.0 8 │ #undef 0 0.0 0.0 9 │ #undef 0 0.0 0.0
julia> df = similar(df,0) # empty a dataframe. The second parameter is the number of rows desired
0×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┴──────────────────────────────────
Work on dataframe structure
julia> df = copy(data)
8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> df.foo = [1,2,3,4,5,6,7,8] # existing or new column
8-element Vector{Int64}: 1 2 3 4 5 6 7 8
julia> df.volHa = data.forvol ./ data.forarea
8-element Vector{Float64}: 297.78051787916155 147.45421245421247 126.49998506437257 113.07992756453503 320.78115421665643 177.1187619544427 148.9003390085646 130.59006433166547
julia> df.goo = Array{Union{Missing,Float64},1}(missing,size(data,1))
8-element Vector{Union{Missing, Float64}}: missing missing missing missing missing missing missing missing
julia> select!(df,Not(["foo","volHa","goo"])) # remove cols by name
8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> rename!(df, ["Country2", "Year2", "forarea2", "forvol2"])
8×4 DataFrame Row │ Country2 Year2 forarea2 forvol2 │ String Int64 Float64 Float64 ─────┼──────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> rename!(df, Dict("Country2" => "Country"))
8×4 DataFrame Row │ Country Year2 forarea2 forvol2 │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> df = df[:,["Year2", "Country","forarea2","forvol2"] ] # change column order
8×4 DataFrame Row │ Year2 Country forarea2 forvol2 │ Int64 String Float64 Float64 ─────┼─────────────────────────────────── 1 │ 2000 Germany 11.354 3381.0 2 │ 2000 France 15.288 2254.28 3 │ 2000 Italy 8.36925 1058.71 4 │ 2000 Sweden 28.163 3184.67 5 │ 2020 Germany 11.419 3663.0 6 │ 2020 France 17.253 3055.83 7 │ 2020 Italy 9.56613 1424.4 8 │ 2020 Sweden 27.98 3653.91
julia> insertcols!(df, 2, :foo => [1,2,3,4,5,6,7,8] ) # insert a column at position 2
8×5 DataFrame Row │ Year2 foo Country forarea2 forvol2 │ Int64 Int64 String Float64 Float64 ─────┼────────────────────────────────────────── 1 │ 2000 1 Germany 11.354 3381.0 2 │ 2000 2 France 15.288 2254.28 3 │ 2000 3 Italy 8.36925 1058.71 4 │ 2000 4 Sweden 28.163 3184.67 5 │ 2020 5 Germany 11.419 3663.0 6 │ 2020 6 France 17.253 3055.83 7 │ 2020 7 Italy 9.56613 1424.4 8 │ 2020 8 Sweden 27.98 3653.91
julia> df.namedYear = map(string,df.Year2)
8-element Vector{String}: "2000" "2000" "2000" "2000" "2020" "2020" "2020" "2020"
julia> stringToInt(str) = try parse(Int64, str) catch; return(missing) end; df.Year3 = map(stringToInt, df.namedYear)
8-element Vector{Int64}: 2000 2000 2000 2000 2020 2020 2020 2020
julia> df2 = hcat(df,data,makeunique=true)
8×11 DataFrame Row │ Year2 foo Country forarea2 forvol2 namedYear Year3 Country_1 ⋯ │ Int64 Int64 String Float64 Float64 String Int64 String ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ 2000 1 Germany 11.354 3381.0 2000 2000 Germany ⋯ 2 │ 2000 2 France 15.288 2254.28 2000 2000 France 3 │ 2000 3 Italy 8.36925 1058.71 2000 2000 Italy 4 │ 2000 4 Sweden 28.163 3184.67 2000 2000 Sweden 5 │ 2020 5 Germany 11.419 3663.0 2020 2020 Germany ⋯ 6 │ 2020 6 France 17.253 3055.83 2020 2020 France 7 │ 2020 7 Italy 9.56613 1424.4 2020 2020 Italy 8 │ 2020 8 Sweden 27.98 3653.91 2020 2020 Sweden 3 columns omitted
julia> df3 = copy(data)
8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> df4 = vcat(data,df3)
16×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91 9 │ Germany 2000 11.354 3381.0 10 │ France 2000 15.288 2254.28 11 │ Italy 2000 8.36925 1058.71 12 │ Sweden 2000 28.163 3184.67 13 │ Germany 2020 11.419 3663.0 14 │ France 2020 17.253 3055.83 15 │ Italy 2020 9.56613 1424.4 16 │ Sweden 2020 27.98 3653.91
Categorical data
julia> using CategoricalArrays # You may want to consider also PooledArrays
julia> df.Year2 = categorical(df.Year2)
8-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}: 2000 2000 2000 2000 2020 2020 2020 2020
julia> transform!(df, names(df, AbstractString) .=> categorical, renamecols=false) # transform to categorical all string columns
8×7 DataFrame Row │ Year2 foo Country forarea2 forvol2 namedYear Year3 │ Cat… Int64 Cat… Float64 Float64 Cat… Int64 ─────┼──────────────────────────────────────────────────────────── 1 │ 2000 1 Germany 11.354 3381.0 2000 2000 2 │ 2000 2 France 15.288 2254.28 2000 2000 3 │ 2000 3 Italy 8.36925 1058.71 2000 2000 4 │ 2000 4 Sweden 28.163 3184.67 2000 2000 5 │ 2020 5 Germany 11.419 3663.0 2020 2020 6 │ 2020 6 France 17.253 3055.83 2020 2020 7 │ 2020 7 Italy 9.56613 1424.4 2020 2020 8 │ 2020 8 Sweden 27.98 3653.91 2020 2020
Attention that while the memory to store the data decreases, and grouping is way more efficient, filtering with categorical values is not necessarily quicker (indeed it can be a bit slower)
julia> levels(df.Year2)
2-element Vector{Int64}: 2000 2020
julia> levels!(df.Country,["Sweden","Germany","France","Italy"]) # Let you define a personalised order, useful for ordered data
8-element CategoricalArrays.CategoricalArray{String,1,UInt32}: "Germany" "France" "Italy" "Sweden" "Germany" "France" "Italy" "Sweden"
julia> sort(df.Country)
8-element CategoricalArrays.CategoricalArray{String,1,UInt32}: "Sweden" "Sweden" "Germany" "Germany" "France" "France" "Italy" "Italy"
julia> sort!(df,"Country")
8×7 DataFrame Row │ Year2 foo Country forarea2 forvol2 namedYear Year3 │ Cat… Int64 Cat… Float64 Float64 Cat… Int64 ─────┼──────────────────────────────────────────────────────────── 1 │ 2000 4 Sweden 28.163 3184.67 2000 2000 2 │ 2020 8 Sweden 27.98 3653.91 2020 2020 3 │ 2000 1 Germany 11.354 3381.0 2000 2000 4 │ 2020 5 Germany 11.419 3663.0 2020 2020 5 │ 2000 2 France 15.288 2254.28 2000 2000 6 │ 2020 6 France 17.253 3055.83 2020 2020 7 │ 2000 3 Italy 8.36925 1058.71 2000 2000 8 │ 2020 7 Italy 9.56613 1424.4 2020 2020
julia> df.Years2 = unwrap.(df.Year2) # convert a categorical array into a normal one.
8-element Vector{Int64}: 2000 2020 2000 2020 2000 2020 2000 2020
Joining dataframes
julia> df1,df2 = copy(data),copy(data)
(8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91, 8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91)
julia> push!(df1,["US",2020,5.0,1000.0])
9×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91 9 │ US 2020 5.0 1000.0
julia> push!(df2,["China",2020,50.0,1000.0])
9×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91 9 │ China 2020 50.0 1000.0
julia> rename!(df2,"Year"=>"year")
9×4 DataFrame Row │ Country year forarea forvol │ String Int64 Float64 Float64 ─────┼─────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91 9 │ China 2020 50.0 1000.0
julia> innerjoin(df1,df2,on=["Country","Year"=>"year"],makeunique=true) # common records only
8×6 DataFrame Row │ Country Year forarea forvol forarea_1 forvol_1 │ String Int64 Float64 Float64 Float64 Float64 ─────┼──────────────────────────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 11.354 3381.0 2 │ France 2000 15.288 2254.28 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 11.419 3663.0 6 │ France 2020 17.253 3055.83 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91 27.98 3653.91
Also available: leftjoin
(all records on left df), rightjoin
(all on right df), outerjoin
(all records returned), semijoin
(like inner join by only with columns from the right df), antijoin
(left not on right df) and crossjoin
(like the cartesian product, each on the right by each on the left)
Pivoting data
long and wide are two kind of layout for equivalent representation of tabled data. In a long layout we have each row being an observation of a single variable, and each record is represented as dim1, dim2, ..., value. As the name implies, "long" layouts tend to be relativelly long and hard to analyse by an human, but are very easity to handle. At the opposite, A wide layout represents multiple observations on the same row, eventually using multiple horizontal axis as in the next figure (but Julia dataframes handle only a single horizzontal axis):
Forest Area | Forest Volumes | |||
2000 | 2020 | 2000 | 2020 | |
Germany | 11.35 | 11.42 | 3381 | 3663 |
France | 15.29 | 17.25 | 2254 | 3056 |
_wide layout is easier to visually analise for a human mind but much more hard to analyse in a sistermatic way. We will learn now how to move from one type of layout to the other.
Stacking columns: from wide to long
julia> longDf = stack(data,["forarea","forvol"]) # we specify the variable to stack (the "measured" variables)
16×4 DataFrame Row │ Country Year variable value │ String Int64 String Float64 ─────┼────────────────────────────────────── 1 │ Germany 2000 forarea 11.354 2 │ France 2000 forarea 15.288 3 │ Italy 2000 forarea 8.36925 4 │ Sweden 2000 forarea 28.163 5 │ Germany 2020 forarea 11.419 6 │ France 2020 forarea 17.253 7 │ Italy 2020 forarea 9.56613 8 │ Sweden 2020 forarea 27.98 9 │ Germany 2000 forvol 3381.0 10 │ France 2000 forvol 2254.28 11 │ Italy 2000 forvol 1058.71 12 │ Sweden 2000 forvol 3184.67 13 │ Germany 2020 forvol 3663.0 14 │ France 2020 forvol 3055.83 15 │ Italy 2020 forvol 1424.4 16 │ Sweden 2020 forvol 3653.91
julia> longDf2 = stack(data,Not(["Country","Year"])) # we specify the variables _not_ to stack (the id variables)
16×4 DataFrame Row │ Country Year variable value │ String Int64 String Float64 ─────┼────────────────────────────────────── 1 │ Germany 2000 forarea 11.354 2 │ France 2000 forarea 15.288 3 │ Italy 2000 forarea 8.36925 4 │ Sweden 2000 forarea 28.163 5 │ Germany 2020 forarea 11.419 6 │ France 2020 forarea 17.253 7 │ Italy 2020 forarea 9.56613 8 │ Sweden 2020 forarea 27.98 9 │ Germany 2000 forvol 3381.0 10 │ France 2000 forvol 2254.28 11 │ Italy 2000 forvol 1058.71 12 │ Sweden 2000 forvol 3184.67 13 │ Germany 2020 forvol 3663.0 14 │ France 2020 forvol 3055.83 15 │ Italy 2020 forvol 1424.4 16 │ Sweden 2020 forvol 3653.91
julia> longDf3 = stack(data) # automatically stack all numerical variables
16×4 DataFrame Row │ Country Year variable value │ String Int64 String Float64 ─────┼────────────────────────────────────── 1 │ Germany 2000 forarea 11.354 2 │ France 2000 forarea 15.288 3 │ Italy 2000 forarea 8.36925 4 │ Sweden 2000 forarea 28.163 5 │ Germany 2020 forarea 11.419 6 │ France 2020 forarea 17.253 7 │ Italy 2020 forarea 9.56613 8 │ Sweden 2020 forarea 27.98 9 │ Germany 2000 forvol 3381.0 10 │ France 2000 forvol 2254.28 11 │ Italy 2000 forvol 1058.71 12 │ Sweden 2000 forvol 3184.67 13 │ Germany 2020 forvol 3663.0 14 │ France 2020 forvol 3055.83 15 │ Italy 2020 forvol 1424.4 16 │ Sweden 2020 forvol 3653.91
julia> longDf == longDf2 == longDf3
true
Note how the columns variable
and value
have been added automatically to host the stachked data
Unstacking columns: from wide to long
julia> wideDf = unstack(longDf,["Country","Year"],"variable","value") # args: df, [cols to remains cols also in the wide layout], column with the ids to expand horizontally and column with the relative values
8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64? Float64? ─────┼──────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> wideDf2 = unstack(longDf,"variable","value") # cols to remains cols also in the wide layout omitted: all cols not to expand and relative value col remains as col
8×4 DataFrame Row │ Country Year forarea forvol │ String Int64 Float64? Float64? ─────┼──────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 2 │ France 2000 15.288 2254.28 3 │ Italy 2000 8.36925 1058.71 4 │ Sweden 2000 28.163 3184.67 5 │ Germany 2020 11.419 3663.0 6 │ France 2020 17.253 3055.83 7 │ Italy 2020 9.56613 1424.4 8 │ Sweden 2020 27.98 3653.91
julia> wideDf == wideDf2 == data
true
While the DataFrames package doesn't support multiple axis we can still arrive to the table below with a little bit of work by unstacking different columns in separate wide dataframes and then joining or horizontally concatenating them:
julia> wideArea = unstack(data,"Country","Year","forarea")
4×3 DataFrame Row │ Country 2000 2020 │ String Float64? Float64? ─────┼───────────────────────────── 1 │ Germany 11.354 11.419 2 │ France 15.288 17.253 3 │ Italy 8.36925 9.56613 4 │ Sweden 28.163 27.98
julia> wideVols = unstack(data,"Country","Year","forvol")
4×3 DataFrame Row │ Country 2000 2020 │ String Float64? Float64? ─────┼───────────────────────────── 1 │ Germany 3381.0 3663.0 2 │ France 2254.28 3055.83 3 │ Italy 1058.71 1424.4 4 │ Sweden 3184.67 3653.91
julia> rename!(wideArea,["Country","area_2000","area_2020"])
4×3 DataFrame Row │ Country area_2000 area_2020 │ String Float64? Float64? ─────┼─────────────────────────────── 1 │ Germany 11.354 11.419 2 │ France 15.288 17.253 3 │ Italy 8.36925 9.56613 4 │ Sweden 28.163 27.98
julia> rename!(wideVols,["Country","vol_2000","vol_2020"])
4×3 DataFrame Row │ Country vol_2000 vol_2020 │ String Float64? Float64? ─────┼───────────────────────────── 1 │ Germany 3381.0 3663.0 2 │ France 2254.28 3055.83 3 │ Italy 1058.71 1424.4 4 │ Sweden 3184.67 3653.91
julia> wideWideDf = outerjoin(wideArea,wideVols,on="Country")
4×5 DataFrame Row │ Country area_2000 area_2020 vol_2000 vol_2020 │ String Float64? Float64? Float64? Float64? ─────┼─────────────────────────────────────────────────── 1 │ Germany 11.354 11.419 3381.0 3663.0 2 │ France 15.288 17.253 2254.28 3055.83 3 │ Italy 8.36925 9.56613 1058.71 1424.4 4 │ Sweden 28.163 27.98 3184.67 3653.91
The Split-Apply-Combine strategy
Aka "divide and conquer". Rather than try to modify the dataset direclty, we first split it in subparts, we work on each subpart and then we recombine them in a target dataset
julia> using Statistics # for `mean`
julia> groupby(data,["Country","Year"]) # The "split" part
GroupedDataFrame with 8 groups based on keys: Country, Year First Group (1 row): Country = "Germany", Year = 2000 Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 ⋮ Last Group (1 row): Country = "Sweden", Year = 2020 Row │ Country Year forarea forvol │ String Int64 Float64 Float64 ─────┼────────────────────────────────── 1 │ Sweden 2020 27.98 3653.91
Aggregation:
julia> combine(groupby(data,["Year"]) , "forarea" => sum => "sum_area", "forvol" => sum => "sum_vol", nrow)
2×4 DataFrame Row │ Year sum_area sum_vol nrow │ Int64 Float64 Float64 Int64 ─────┼────────────────────────────────── 1 │ 2000 63.1743 9878.66 4 2 │ 2020 66.2181 11797.1 4
...or...
julia> combine(groupby(data,["Year"])) do subdf # slower (sumarea = sum(subdf.forarea), sumvol = sum(subdf.forvol), nCountries = size(subdf,1)) end
2×4 DataFrame Row │ Year sumarea sumvol nCountries │ Int64 Float64 Float64 Int64 ─────┼────────────────────────────────────── 1 │ 2000 63.1743 9878.66 4 2 │ 2020 66.2181 11797.1 4
Cumulative computation:
julia> a = combine(groupby(data,["Year"])) do subdf # slower (country = subdf.Country, area = subdf.forarea, cumArea = cumsum(subdf.forarea)) end
8×4 DataFrame Row │ Year country area cumArea │ Int64 String Float64 Float64 ─────┼─────────────────────────────────── 1 │ 2000 Germany 11.354 11.354 2 │ 2000 France 15.288 26.642 3 │ 2000 Italy 8.36925 35.0112 4 │ 2000 Sweden 28.163 63.1743 5 │ 2020 Germany 11.419 11.419 6 │ 2020 France 17.253 28.672 7 │ 2020 Italy 9.56613 38.2381 8 │ 2020 Sweden 27.98 66.2181
Note in these examples that while in the aggregation we was returning a single record for each subgroup (hence we did some dimensionality reduction) in the cumulative compuation we still output the whole subgroup, so the combined dataframe in output has the same number of rows as the original dataframe.
An alternative approach is to use the @linq
macro from the DatAFrameMEta
package that provide a R's dplyr
-like query language using piped data:
julia> using DataFramesMeta
julia> dfCum = @linq data |> groupby([:Year]) |> transform(:cumArea = cumsum(:forarea))
8×5 DataFrame Row │ Country Year forarea forvol cumArea │ String Int64 Float64 Float64 Float64 ─────┼──────────────────────────────────────────── 1 │ Germany 2000 11.354 3381.0 11.354 2 │ France 2000 15.288 2254.28 26.642 3 │ Italy 2000 8.36925 1058.71 35.0112 4 │ Sweden 2000 28.163 3184.67 63.1743 5 │ Germany 2020 11.419 3663.0 11.419 6 │ France 2020 17.253 3055.83 28.672 7 │ Italy 2020 9.56613 1424.4 38.2381 8 │ Sweden 2020 27.98 3653.91 66.2181
Export and saving
DataFrame to Matrix
julia> M = Matrix(data)
8×4 Matrix{Any}: "Germany" 2000 11.354 3381.0 "France" 2000 15.288 2254.28 "Italy" 2000 8.36925 1058.71 "Sweden" 2000 28.163 3184.67 "Germany" 2020 11.419 3663.0 "France" 2020 17.253 3055.83 "Italy" 2020 9.56613 1424.4 "Sweden" 2020 27.98 3653.91
Attention that if the dataframe contains different types across columns, the inner type of the matrix will be Any
julia> M = Matrix{Union{Float64,Int64,String}}(data)
8×4 Matrix{Union{Float64, Int64, String}}: "Germany" 2000 11.354 3381.0 "France" 2000 15.288 2254.28 "Italy" 2000 8.36925 1058.71 "Sweden" 2000 28.163 3184.67 "Germany" 2020 11.419 3663.0 "France" 2020 17.253 3055.83 "Italy" 2020 9.56613 1424.4 "Sweden" 2020 27.98 3653.91
DataFrame to Dictionary
julia> function toDict(df, dimCols, valueCol) toReturn = Dict() for r in eachrow(df) keyValues = [] [push!(keyValues,r[d]) for d in dimCols] toReturn[(keyValues...,)] = r[valueCol] end return toReturn end
toDict (generic function with 1 method)
julia> dict = toDict(data,["Country","Year"],["forarea","forvol"])
Dict{Any, Any} with 8 entries: ("Germany", 2020) => DataFrameRow… ("Italy", 2020) => DataFrameRow… ("Italy", 2000) => DataFrameRow… ("Germany", 2000) => DataFrameRow… ("Sweden", 2000) => DataFrameRow… ("France", 2020) => DataFrameRow… ("France", 2000) => DataFrameRow… ("Sweden", 2020) => DataFrameRow…
julia> dict["Germany",2000][1]
11.354
julia> dict["Germany",2000]["forvol"]
3381.0
julia> toDict(data,["Country","Year"],"forarea")
Dict{Any, Any} with 8 entries: ("Germany", 2020) => 11.419 ("Italy", 2020) => 9.56613 ("Italy", 2000) => 8.36925 ("Germany", 2000) => 11.354 ("Sweden", 2000) => 28.163 ("France", 2020) => 17.253 ("France", 2000) => 15.288 ("Sweden", 2020) => 27.98
DataFrame to NamedTuple
julia> nT = NamedTuple(Dict([Symbol(c) => data[:,c] for c in names(data)])) # order not necessarily preserved
(Country = ["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"], forvol = [3381.0, 2254.28, 1058.71, 3184.67, 3663.0, 3055.83, 1424.4, 3653.91], Year = [2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020], forarea = [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98])
julia> using DataStructures
julia> nT = NamedTuple(OrderedDict([Symbol(c) => data[:,c] for c in names(data)])) # order preserved
(Country = ["Germany", "France", "Italy", "Sweden", "Germany", "France", "Italy", "Sweden"], Year = [2000, 2000, 2000, 2000, 2020, 2020, 2020, 2020], forarea = [11.354, 15.288, 8.36925, 28.163, 11.419, 17.253, 9.56613, 27.98], forvol = [3381.0, 2254.28, 1058.71, 3184.67, 3663.0, 3055.83, 1424.4, 3653.91])
Saving as CSV file
julia> rm("outdata.csv", force=true)
julia> CSV.write("outdata.csv",data) # see options at the beginning of segment in the import section and `? CSV.write` for specific export options
"outdata.csv"
Saving as OpenDocument spreadsheet
julia> rm("outdata.ods", force=true)
julia> ods_write("outdata.ods",Dict(("myData",3,2) => data)) # exported starting on cell B3 of sheet "myData"
Saving as Excel spreadsheet
julia> rm("outdata.xlsx", force=true)
julia> XLSX.writetable("outdata.xlsx",myData = (collect(eachcol(data)),names(data)))
This page was generated using Literate.jl.