uCSV.jl Documentation
Functions
uCSV.read — Function.read(input;
delim=',',
quotes=missing,
escape=missing,
comment=missing,
encodings=Dict{String, Any}(),
header=0,
skiprows=Vector{Int}(),
types=Dict{Int,DataType}(),
allowmissing=Dict{Int,Bool}(),
coltypes=Vector,
colparsers=Dict{Int,Function}(),
typeparsers=Dict{DataType, Function}(),
typedetectrows=1,
skipmalformed=false,
trimwhitespace=false)Take an input file or IO source and user-defined parsing rules and return:
- a
Vector{Any}containing the parsed columns - a
Vector{String}containing the header (column names)
Arguments
input- the path to a local file, or an open IO source from which to read data
delim- a
CharorStringthat separates fields in the dataset - default:
delim=','- for CSV files
- frequently used:
delim='\t'delim=' 'delim='|'
- a
quotes- a
Charused for quoting fields in the dataset - default:
quotes=missing- by default, the parser does not check for quotes
- frequently used:
quotes='"'
- a
escape- a
Charused for escaping other reserved parsing characters - default:
escape=missing- by default, the parser does not check for escapes
- frequently used:
escape='"'- double-quotes within quotes, e.g.
"firstname ""nickname"" lastname"
- double-quotes within quotes, e.g.
escape='\\'- note that the first backslash is just to escape the second backslash
- e.g.
"firstname \"nickname\" lastname"
- a
comment- a
CharorStringat the beginning of lines that should be skipped as comments- note that skipped comment lines do not contribute to the line count for the header (if the user requests parsing a header on a specific row) or for skiprows
- default:
comment=missing- by default, the parser does not check for comments
- frequently used:
comment='#'comment='!'comment="#!"
- a
encodings- a
Dict{String, Any}mapping parsed fields to Julia values- if your dataset has booleans that are not represented as
"true"and"false"or missing values that you'd like to read asmissings, you'll need to use this!
- if your dataset has booleans that are not represented as
- default:
encodings=Dict{String, Any}()- by default, the parser does not check for any reserved fields
- frequently used:
encodings=Dict("" => missing)encodings=Dict("NA" => missing)encodings=Dict("N/A" => missing)encodings=Dict("NULL" => missing)encodings=Dict("TRUE" => true, "FALSE" => false)encodings=Dict("True" => true, "False" => false)encodings=Dict("T" => true, "F" => false)encodings=Dict("yes" => true, "no" => false)- ... your encodings here ...
- can include any number of
String=> value mappings - note that if the user requests
quotes,escapes, ortrimwhitespace, these requests will be applied (removed) the raw string BEFORE checking whether the field matches any strings in in theencodingsargument
- can include any number of
- a
header- an
Intindicating which line of the dataset contains column names or aVector{String}of column names- note that commented lines and blank lines do not contribute to this value e.g. if the first 3 lines of your dataset are comments, you'll still need to set
header=1to interpret the first line of parsed data as the header
- note that commented lines and blank lines do not contribute to this value e.g. if the first 3 lines of your dataset are comments, you'll still need to set
- default:
header=0- no header is checked for by default
- frequently used:
header=1
- an
skiprows- a
RangeorVectorofInts indicating which rows to skip in the dataset- note that this is 1-based in reference to the first row AFTER the header. if
header=0or is provided by the user, this will be the first non-empty line in the dataset. otherwiseskiprows=1:1will skip theheader+1-nth line in the file
- note that this is 1-based in reference to the first row AFTER the header. if
- default:
skiprows=Vector{Int}()- no rows are skipped
- a
types- declare the types of the columns
- scalar, e.g.
types=Bool- scalars will be broadcast to apply to every column of the dataset
- vector, e.g.
types=[Bool, Int, Float64, String, Symbol, Date, DateTime]- the vector length must match the number of parsed columns
- dictionary, e.g.
types=("column1" => Bool)ortypes=(1 => Union{Int, Missing})- users can refer to the columns by name (only if a header is provided or parsed!) or by index
- scalar, e.g.
- default:
types=Dict{Int,DataType}()- column-types will be interpreted from the dataset
- built-in support for parsing the following:
IntFloat64StringSymbolDate– only the default date format will workDateTime– only the default datetime format will work- for other types or unsupported formats, see
colparsersandtypeparsers
- declare the types of the columns
allowmissing- declare whether columns should have element-type
Union{T, Missing} where T- boolean scalar, e.g.
allowmissing=true- scalars will be broadcast to apply to every column of the dataset
- vector, e.g.
allowmissing=[true, false, true, true]- the vector length must match the number of parsed columns
- dictionary, e.g.
allowmissing=("column1" => true)orallowmissing=(17 => true)- users can refer to the columns by name (only if a header is provided or parsed!) or by index
- boolean scalar, e.g.
- default:
allowmissing=Dict{Int,Bool}()- Allowing missing values is determined by type detection in rows
1:typedetectrows
- Allowing missing values is determined by type detection in rows
- declare whether columns should have element-type
coltypes- declare the type of vector that should be used for columns
- should work for any AbstractVector that allows
push!ing values- scalar, e.g.
coltypes=CategoricalVector- scalars will be broadcast to apply to every column of the dataset
- vector, e.g.
coltypes=[CategoricalVector, Vector, CategoricalVector]- the vector length must match the number of parsed columns
- dictionary, e.g.
coltypes=("column1" => CategoricalVector)orcoltypes=(17 => CategoricalVector)- users can refer to the columns by name (only if a header is provided or parsed!) or by index
- scalar, e.g.
- default:
coltypes=Vector- all columns are returned as standard julia
Vectors
- all columns are returned as standard julia
colparsers- provide custom functions for converting parsed strings to values by column
- scalar, e.g.
colparsers=(x -> parse(Float64, replace(x, ',', '.')))- scalars will be broadcast to apply to every column of the dataset
- vector, e.g.
colparsers=[x -> mydateparser(x), x -> mytimeparser(x)]- the vector length must match the number of parsed columns
- dictionary, e.g.
colparsers=("column1" => x -> mydateparser(x))- users can refer to the columns by name (only if a header is provided or parsed!) or by index
- scalar, e.g.
- default:
colparsers=Dict{Int,Function}()- column parsers are determined based on user-specified types and those detected from the data
- provide custom functions for converting parsed strings to values by column
typeparsers- provide custom functions for converting parsed strings to values by column type
- NOTE must be used with
coltypes. If you supply a custom Int parser you'd like to use to parse column 6, you'll need to setcoltypes=dict(6 => Int)for it to work
- NOTE must be used with
- default:
colparsers=Dict{DataType, Function}()- column parsers are determined based on user-specified types and those detected from the data
- frequently used:
typeparsers=Dict(Float64 => x -> parse(Float64, replace(x, ',' => '.')))# decimal-comma floats!
- provide custom functions for converting parsed strings to values by column type
typedetectrows- specify how many rows of data to read before interpretting the values that each column should take on
- default:
typedetectrows=1- must be >= 1
- commented, skipped, and empty lines are not counted when determining which rows are used for type detection, e.g. setting
typedetectrows=10andskiprows=1:5means type detection will occur on rows6:15
skipmalformed- specify whether the parser should skip a line or fail with an error if a line is parsed but does not contain the expected number of rows
- default:
skipmalformed=false- malformed lines result in an error
trimwhitespace- specify whether should extra whitespace be removed from the beginning and ends of fields.
- e.g
...., myfield ,...trimwhitespace=false->" myfield "trimwhitespace=true->"myfield"
- e.g
- leading and trailing whitespace OUTSIDE of quoted fields is trimmed by default.
- e.g.
...., " myfield " ,...->" myfield "whenquotes='"'
- e.g.
trimwhitespace=truewill also trim leading and trailing whitespace WITHIN quotes- default:
trimwhitespace=false
- specify whether should extra whitespace be removed from the beginning and ends of fields.
uCSV.write — Function.function write(output;
header=missing,
data=missing,
delim=',',
quotes=missing,
quotetypes=AbstractString)Write a dataset to disk or IO
Arguments
output- the path on disk or IO where you want to write to
header- the column names for the data to
output - default:
header=missing- no header is written
- the column names for the data to
data- the dataset to write to
output - default:
data=missing- no data is written
- the dataset to write to
delim- the delimiter to seperate fields by
- default:
delim=','- for CSV files
- frequently used:
delim='\t'delim=' 'delim='|'
quotes- the quoting character to use when writing fields
- default:
quotes=missing- fields are not quoted by default, and fields are written using julia's default string-printing mechanisms
quotetypes::Type- when quoting fields, quote only columns where
coltype <: quotetypes- columns of type
Union{<:quotetypes, Missing}will also be quoted
- columns of type
- default:
quotetypes=AbsractString- only the header and fields where
coltype <: AbsractStringwill be quoted
- only the header and fields where
- frequently used:
quotetypes=Any- quote every field in the dataset
- when quoting fields, quote only columns where
function write(output,
df;
delim=',',
quotes=missing,
quotetypes=AbstractString)Write a DataFrame to disk or IO
uCSV.tomatrix — Function.Convert the data output by uCSV.read to a Matrix. Column names are ignored
uCSV.tovector — Function.Convert the data output by uCSV.read to a Vector. Column names are ignored
Manual
- Getting Started
- Headers
- Reading into DataFrames
- Delimiters
- Missing Data
- Declaring Column Element Types
- Declaring Column Vector Types
- International Representations for Numbers
- Custom Parsers
- Quotes and Escapes
- Skipping Comments and Rows
- Malformed Data
- Reading Data from URLs
- Reading Compressed Datasets
- Common Formatting Issues
- Writing Data
- Benchmarks