What is the best strategy for transforming unicode strings into filenames?

Posted by David Cowden on Programmers See other posts from Programmers or by David Cowden
Published on 2012-07-10T14:50:34Z Indexed on 2012/07/10 15:23 UTC
Read the original article Hit count: 413

I have a bunch (thousands) of resources in an RDF/XML file. I am writing a certain subset of the resources to files -- one file for each, and I'm using the resource's title property as the file name. However, the titles are every day article, website, and blog post titles, so they contain characters unsafe for a URI (the necessary step for constructing a valid file path). I know of the Jersey UriBuilder but I can't quite get it to work for my needs as I detailed in a different question on SO.

Some possibilities I have considered are:

  • Since each resource should also have an associated URL, I could try to use the name of the file on the server. The down side of this is sometimes people don't name their content logically and I think the title of an article better reflects the content that will be in each text file.
  • Construct a white list of valid characters and parse the string myself defining substitutions for unsafe characters. The downside of this is the result could be just as unreadable as the former solution because presumably the content creators went through a similar process when placing the files on their server.
  • Choose a more generic naming scheme, place the title in the text file along with the other attributes, and tell my boss to live with it.

So my question here is, what methods work well for dealing with a scenario where you need to construct file names out of strings with potentially unsafe characters? Is there a solution that better fills out my constraints?

© Programmers or respective owner

Related posts about data

Related posts about regular-expressions