Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sure. But this is a pet peeves of mine: strings are text. Bytes (and thus encodings) something you should only be concerned about when doing file or network IO. It's boundary-stuff, and none of your actual text-processing should depend on it.

Consider the phrasing my way of shielding you from accusations about being entirely wrong ;)



> It's boundary-stuff, and none of your actual text-processing should depend on it.

I strongly disagree. For high performance text search, it's critical that you deal with its in-memory representation explicitly. This violates your maxim that such things are only done at the boundaries.

For example, if you're implementing substring search, the techniques you use will heavily depend on how your string is represented in memory. Is it UTF-16? UTF-8? A sequence of codepoints? A sequence of grapheme clusters, where each cluster is a sequence of codepoints? Each of these choices will require different substring search strategies if you care about squeezing the most juice out of the underlying hardware.


Sometimes I think we would be better off if our languages didn't have string as a data type at all.

The text encodings themselves (e.g. UTF-8, UTF-32) ought to be proper data types. Strings are a leaky abstraction that cause otherwise competent programmers to have funny ideas about what text is and isn't, as this entire thread demonstrates.


>strings are text

I do not completely agree with that. Text is what strings are mostly used for, but I wouldn't call a base64 encoded image text. Text is something a person can read and make sense of.

So I think it would be more accurate to say that a string is a sequence of characters (grapheme clusters in unicode speak).


Or memory "IO", which standard libraries are often bad at abstracting away.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: