uCSV.jl Documentation
Functions
uCSV.read
— Function.read(input;
delim=',',
quotes=missing,
escape=missing,
comment=missing,
encodings=Dict{String, Any}(),
header=0,
skiprows=Vector{Int}(),
types=Dict{Int,DataType}(),
allowmissing=Dict{Int,Bool}(),
coltypes=Vector,
colparsers=Dict{Int,Function}(),
typeparsers=Dict{DataType, Function}(),
typedetectrows=1,
skipmalformed=false,
trimwhitespace=false)
Take an input file or IO source and user-defined parsing rules and return:
- a
Vector{Any}
containing the parsed columns - a
Vector{String}
containing the header (column names)
Arguments
input
- the path to a local file, or an open IO source from which to read data
delim
- a
Char
orString
that separates fields in the dataset - default:
delim=','
- for CSV files
- frequently used:
delim='\t'
delim=' '
delim='|'
- a
quotes
- a
Char
used for quoting fields in the dataset - default:
quotes=missing
- by default, the parser does not check for quotes
- frequently used:
quotes='"'
- a
escape
- a
Char
used for escaping other reserved parsing characters - default:
escape=missing
- by default, the parser does not check for escapes
- frequently used:
escape='"'
- double-quotes within quotes, e.g.
"firstname ""nickname"" lastname"
- double-quotes within quotes, e.g.
escape='\\'
- note that the first backslash is just to escape the second backslash
- e.g.
"firstname \"nickname\" lastname"
- a
comment
- a
Char
orString
at the beginning of lines that should be skipped as comments- note that skipped comment lines do not contribute to the line count for the header (if the user requests parsing a header on a specific row) or for skiprows
- default:
comment=missing
- by default, the parser does not check for comments
- frequently used:
comment='#'
comment='!'
comment="#!"
- a
encodings
- a
Dict{String, Any}
mapping parsed fields to Julia values- if your dataset has booleans that are not represented as
"true"
and"false"
or missing values that you'd like to read asmissing
s, you'll need to use this!
- if your dataset has booleans that are not represented as
- default:
encodings=Dict{String, Any}()
- by default, the parser does not check for any reserved fields
- frequently used:
encodings=Dict("" => missing)
encodings=Dict("NA" => missing)
encodings=Dict("N/A" => missing)
encodings=Dict("NULL" => missing)
encodings=Dict("TRUE" => true, "FALSE" => false)
encodings=Dict("True" => true, "False" => false)
encodings=Dict("T" => true, "F" => false)
encodings=Dict("yes" => true, "no" => false)
- ... your encodings here ...
- can include any number of
String
=> value mappings - note that if the user requests
quotes
,escapes
, ortrimwhitespace
, these requests will be applied (removed) the raw string BEFORE checking whether the field matches any strings in in theencodings
argument
- can include any number of
- a
header
- an
Int
indicating which line of the dataset contains column names or aVector{String}
of column names- note that commented lines and blank lines do not contribute to this value e.g. if the first 3 lines of your dataset are comments, you'll still need to set
header=1
to interpret the first line of parsed data as the header
- note that commented lines and blank lines do not contribute to this value e.g. if the first 3 lines of your dataset are comments, you'll still need to set
- default:
header=0
- no header is checked for by default
- frequently used:
header=1
- an
skiprows
- a
Range
orVector
ofInt
s indicating which rows to skip in the dataset- note that this is 1-based in reference to the first row AFTER the header. if
header=0
or is provided by the user, this will be the first non-empty line in the dataset. otherwiseskiprows=1:1
will skip theheader+1
-nth line in the file
- note that this is 1-based in reference to the first row AFTER the header. if
- default:
skiprows=Vector{Int}()
- no rows are skipped
- a
types
- declare the types of the columns
- scalar, e.g.
types=Bool
- scalars will be broadcast to apply to every column of the dataset
- vector, e.g.
types=[Bool, Int, Float64, String, Symbol, Date, DateTime]
- the vector length must match the number of parsed columns
- dictionary, e.g.
types=("column1" => Bool)
ortypes=(1 => Union{Int, Missing})
- users can refer to the columns by name (only if a header is provided or parsed!) or by index
- scalar, e.g.
- default:
types=Dict{Int,DataType}()
- column-types will be interpreted from the dataset
- built-in support for parsing the following:
Int
Float64
String
Symbol
Date
– only the default date format will workDateTime
– only the default datetime format will work- for other types or unsupported formats, see
colparsers
andtypeparsers
- declare the types of the columns
allowmissing
- declare whether columns should have element-type
Union{T, Missing} where T
- boolean scalar, e.g.
allowmissing=true
- scalars will be broadcast to apply to every column of the dataset
- vector, e.g.
allowmissing=[true, false, true, true]
- the vector length must match the number of parsed columns
- dictionary, e.g.
allowmissing=("column1" => true)
orallowmissing=(17 => true)
- users can refer to the columns by name (only if a header is provided or parsed!) or by index
- boolean scalar, e.g.
- default:
allowmissing=Dict{Int,Bool}()
- Allowing missing values is determined by type detection in rows
1:typedetectrows
- Allowing missing values is determined by type detection in rows
- declare whether columns should have element-type
coltypes
- declare the type of vector that should be used for columns
- should work for any AbstractVector that allows
push!
ing values- scalar, e.g.
coltypes=CategoricalVector
- scalars will be broadcast to apply to every column of the dataset
- vector, e.g.
coltypes=[CategoricalVector, Vector, CategoricalVector]
- the vector length must match the number of parsed columns
- dictionary, e.g.
coltypes=("column1" => CategoricalVector)
orcoltypes=(17 => CategoricalVector)
- users can refer to the columns by name (only if a header is provided or parsed!) or by index
- scalar, e.g.
- default:
coltypes=Vector
- all columns are returned as standard julia
Vector
s
- all columns are returned as standard julia
colparsers
- provide custom functions for converting parsed strings to values by column
- scalar, e.g.
colparsers=(x -> parse(Float64, replace(x, ',', '.')))
- scalars will be broadcast to apply to every column of the dataset
- vector, e.g.
colparsers=[x -> mydateparser(x), x -> mytimeparser(x)]
- the vector length must match the number of parsed columns
- dictionary, e.g.
colparsers=("column1" => x -> mydateparser(x))
- users can refer to the columns by name (only if a header is provided or parsed!) or by index
- scalar, e.g.
- default:
colparsers=Dict{Int,Function}()
- column parsers are determined based on user-specified types and those detected from the data
- provide custom functions for converting parsed strings to values by column
typeparsers
- provide custom functions for converting parsed strings to values by column type
- NOTE must be used with
coltypes
. If you supply a custom Int parser you'd like to use to parse column 6, you'll need to setcoltypes=dict(6 => Int)
for it to work
- NOTE must be used with
- default:
colparsers=Dict{DataType, Function}()
- column parsers are determined based on user-specified types and those detected from the data
- frequently used:
typeparsers=Dict(Float64 => x -> parse(Float64, replace(x, ',' => '.')))
# decimal-comma floats!
- provide custom functions for converting parsed strings to values by column type
typedetectrows
- specify how many rows of data to read before interpretting the values that each column should take on
- default:
typedetectrows=1
- must be >= 1
- commented, skipped, and empty lines are not counted when determining which rows are used for type detection, e.g. setting
typedetectrows=10
andskiprows=1:5
means type detection will occur on rows6:15
skipmalformed
- specify whether the parser should skip a line or fail with an error if a line is parsed but does not contain the expected number of rows
- default:
skipmalformed=false
- malformed lines result in an error
trimwhitespace
- specify whether should extra whitespace be removed from the beginning and ends of fields.
- e.g
...., myfield ,...
trimwhitespace=false
->" myfield "
trimwhitespace=true
->"myfield"
- e.g
- leading and trailing whitespace OUTSIDE of quoted fields is trimmed by default.
- e.g.
...., " myfield " ,...
->" myfield "
whenquotes='"'
- e.g.
trimwhitespace=true
will also trim leading and trailing whitespace WITHIN quotes- default:
trimwhitespace=false
- specify whether should extra whitespace be removed from the beginning and ends of fields.
uCSV.write
— Function.function write(output;
header=missing,
data=missing,
delim=',',
quotes=missing,
quotetypes=AbstractString)
Write a dataset to disk or IO
Arguments
output
- the path on disk or IO where you want to write to
header
- the column names for the data to
output
- default:
header=missing
- no header is written
- the column names for the data to
data
- the dataset to write to
output
- default:
data=missing
- no data is written
- the dataset to write to
delim
- the delimiter to seperate fields by
- default:
delim=','
- for CSV files
- frequently used:
delim='\t'
delim=' '
delim='|'
quotes
- the quoting character to use when writing fields
- default:
quotes=missing
- fields are not quoted by default, and fields are written using julia's default string-printing mechanisms
quotetypes::Type
- when quoting fields, quote only columns where
coltype <: quotetypes
- columns of type
Union{<:quotetypes, Missing}
will also be quoted
- columns of type
- default:
quotetypes=AbsractString
- only the header and fields where
coltype <: AbsractString
will be quoted
- only the header and fields where
- frequently used:
quotetypes=Any
- quote every field in the dataset
- when quoting fields, quote only columns where
function write(output,
df;
delim=',',
quotes=missing,
quotetypes=AbstractString)
Write a DataFrame to disk or IO
uCSV.tomatrix
— Function.Convert the data output by uCSV.read to a Matrix
. Column names are ignored
uCSV.tovector
— Function.Convert the data output by uCSV.read to a Vector
. Column names are ignored
Manual
- Getting Started
- Headers
- Reading into DataFrames
- Delimiters
- Missing Data
- Declaring Column Element Types
- Declaring Column Vector Types
- International Representations for Numbers
- Custom Parsers
- Quotes and Escapes
- Skipping Comments and Rows
- Malformed Data
- Reading Data from URLs
- Reading Compressed Datasets
- Common Formatting Issues
- Writing Data
- Benchmarks