Documentation and code for manipulating sleep diaries
Many sleep diary programs export data using CSV (comma-separated values) files. See Wikipedia’s CSV page for a general discussion of the format. This page will discuss specific issues faced by programs that encode and decode CSV sleep diaries.
UTF-8 has become the most common way of encoding text in recent decades. But the CSV format is older than UTF-8, and there is no standard way to indicate the character encoding of a CSV file. Most sleep diary programs only support UTF-8 text and do not support any kind of marker to indicate the character encoding. But most users expect CSV files to work in Excel, which requires UTF-8 files to begin with a byte order mark. And very old programs may only support a rare character encoding.
Here are some specific steps you should take to ensure your program handles UTF-8 well:
0xCF
and 0x80
generates UTF-80xE2
, 0x82
, then 0xAC
uses UTF-80xAC
uses Latin-90x80
uses Windows-12520xC2
and 0xA3
uses UTF-80xA3
and can encode the Euro sign (€) uses either Latin-9 or Windows-12520xA3
but cannot encode the Euro sign (€) uses Latin-1CSV is a loosely-defined format, having existed for over 30 years before the official specification was written. This mainly shows up in the way special characters are encoded in text fields. Specifically, developers generally struggle with how to encode a literal comma, how to encode a literal newline and how to encode a literal double quote. The specification says to put strings containing special characters between double quotes, to encode commas and newlines literally, and to encode double quotes as a pair of double quotes. For example:
"this is a single field containing one comma (,) one newline (
) and one double quote ("")"
In practice, programs often handle these cases poorly. For example, a program might assume that commas and newlines always indicate field and line boundaries, might encode newlines as \n
, or might encode quotes as \"
. Programs that use backslashes to escape characters don’t always escape literal backslashes, so \n
could indicate a newline or a literal backslash followed by a literal n
.
Here are some specific steps you should take to ensure your program handles text fields well:
Once you understand the programs you interact with, you will have to decide how much work you want to put into processing difficult cases. It is usually possible to construct some data that cannot be processed unambigiously, but those are rarely cases that occur in the real world. The recommended solutions in this project generally try to handle as many realistic cases as possible without making the code difficult to maintain.