Missing Data
Missing data is very common in many fields of research, but not ALL fields of research. In addition, users may want to handle different encodings for missing data differently, e.g. encoding data that has been masked/removed for privacy reasons with a different value than data that simply doesn't exist. To enable these distinctions, uCSV requires that users provide arguments that instruct uCSV.read
how they would like missing data to be parsed. The two easiest ways to achieve this are with the typedetectrows
and allowmissing
arguments. If typedetectrows > 1
and both missing
and some non-missing
type T
values are encountered in the column, uCSV.read
will return that column as Union{T, Missing}
. For instances where the first missing value is encountered many hundreds of lines down the dataset, it is advised that you declare which columns may contain missing values with the allowmissing
argument for improved parsing efficiency. Users may also use the types
argument to specify a column as being Union{T, Missing}
.
Detecting columns that contain missing values via typedetectrows
julia> using uCSV, DataFrames
julia> s =
"""
1,hey,1
2,you,2
3,,3
4,"",4
5,NULL,5
6,NA,6
""";
julia> encodings = Dict("" => missing, "\"\"" => missing, "NULL" => missing, "NA" => missing);
julia> DataFrame(uCSV.read(IOBuffer(s), encodings=encodings, typedetectrows=3))
6×3 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Int64 │ String⍰ │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1 │ 1 │ hey │ 1 │
│ 2 │ 2 │ you │ 2 │
│ 3 │ 3 │ missing │ 3 │
│ 4 │ 4 │ missing │ 4 │
│ 5 │ 5 │ missing │ 5 │
│ 6 │ 6 │ missing │ 6 │
Declaring that all columns may contain missing values
julia> using uCSV, DataFrames
julia> s =
"""
1,hey,1
2,you,2
3,,3
4,"",4
5,NULL,5
6,NA,6
""";
julia> encodings = Dict("" => missing, "\"\"" => missing, "NULL" => missing, "NA" => missing);
julia> DataFrame(uCSV.read(IOBuffer(s), encodings=encodings, allowmissing=true))
6×3 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Int64⍰ │ String⍰ │ Int64⍰ │
├─────┼────────┼─────────┼────────┤
│ 1 │ 1 │ hey │ 1 │
│ 2 │ 2 │ you │ 2 │
│ 3 │ 3 │ missing │ 3 │
│ 4 │ 4 │ missing │ 4 │
│ 5 │ 5 │ missing │ 5 │
│ 6 │ 6 │ missing │ 6 │
Declaring whether each column may contain missing values with a boolean vector
julia> using uCSV, DataFrames
julia> s =
"""
1,hey,1
2,you,2
3,,3
4,"",4
5,NULL,5
6,NA,6
""";
julia> encodings = Dict("" => missing, "\"\"" => missing, "NULL" => missing, "NA" => missing);
julia> DataFrame(uCSV.read(IOBuffer(s), encodings=encodings, allowmissing=[false, true, false]))
6×3 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Int64 │ String⍰ │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1 │ 1 │ hey │ 1 │
│ 2 │ 2 │ you │ 2 │
│ 3 │ 3 │ missing │ 3 │
│ 4 │ 4 │ missing │ 4 │
│ 5 │ 5 │ missing │ 5 │
│ 6 │ 6 │ missing │ 6 │
Declaring the missingability of a subset of columns with a Dictionary (keys are column indices)
julia> using uCSV, DataFrames
julia> s =
"""
1,hey,1
2,you,2
3,,3
4,"",4
5,NULL,5
6,NA,6
""";
julia> encodings = Dict("" => missing, "\"\"" => missing, "NULL" => missing, "NA" => missing);
julia> DataFrame(uCSV.read(IOBuffer(s), encodings=encodings, allowmissing=Dict(2 => true)))
6×3 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Int64 │ String⍰ │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1 │ 1 │ hey │ 1 │
│ 2 │ 2 │ you │ 2 │
│ 3 │ 3 │ missing │ 3 │
│ 4 │ 4 │ missing │ 4 │
│ 5 │ 5 │ missing │ 5 │
│ 6 │ 6 │ missing │ 6 │
Declaring the missingability of a subset of columns with a Dictionary (keys are column names)
julia> using uCSV, DataFrames
julia> s =
"""
a,b,c
1,hey,1
2,you,2
3,,3
4,"",4
5,NULL,5
6,NA,6
""";
julia> encodings = Dict("" => missing, "\"\"" => missing, "NULL" => missing, "NA" => missing);
julia> DataFrame(uCSV.read(IOBuffer(s), encodings=encodings, header=1, allowmissing=Dict("b" => true)))
6×3 DataFrames.DataFrame
│ Row │ a │ b │ c │
│ │ Int64 │ String⍰ │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1 │ 1 │ hey │ 1 │
│ 2 │ 2 │ you │ 2 │
│ 3 │ 3 │ missing │ 3 │
│ 4 │ 4 │ missing │ 4 │
│ 5 │ 5 │ missing │ 5 │
│ 6 │ 6 │ missing │ 6 │
Declaring the missingability of a subset of columns by specifying the element-type
julia> using uCSV, DataFrames
julia> s =
"""
1,hey,1
2,you,2
3,,3
4,"",4
5,NULL,5
6,NA,6
""";
julia> encodings = Dict("" => missing, "\"\"" => missing, "NULL" => missing, "NA" => missing);
julia> DataFrame(uCSV.read(IOBuffer(s), encodings=encodings, types=Dict(2 => Union{String, Missing})))
6×3 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Int64 │ String⍰ │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1 │ 1 │ hey │ 1 │
│ 2 │ 2 │ you │ 2 │
│ 3 │ 3 │ missing │ 3 │
│ 4 │ 4 │ missing │ 4 │
│ 5 │ 5 │ missing │ 5 │
│ 6 │ 6 │ missing │ 6 │