Search Results

Search found 1649 results on 66 pages for 'unicode normalization'.

Page 5/66 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Is there stl and utf8 friendly C++ Wrapper for ICU, or other powerful unicode library

    - by artyom
    Hello, I need a good Unicode library for C++. I need Transformations in Unicode sensitive way. For example sort all strings in case insensitive way and get their first characters for index. Convert to upper and to lower various Unicode strings. Split text in reasonable position -- words that would work for Chinese and Japanese as well. Formatting numbers, dates in locale sensitive way (should be thread safe). Transparent support of utf8 (primary internal representation). As far as I know the best library is ICU. However, I can't find normal developer friendly API documentation with examples. Also as far as I see, it is not too friendly with modern C++ design, work with STL and so on. Like this std::string msg; unistring umsg.from_utf8(msg); unistring::word_iterator wi; for(wi=umsg.words().begin(),n=0;wi!=usmg.words().wi_end(),n<10;++wi,++n) ; msg=umsg.substr(umsg.words().begin(),wi).to_utf8(); cout<<_("Five 10 words are ")<<msg; Does anybody know good STL friendly ICU wrapper released under Open Source license preferred permissive like MIT or Boost, but others LGPLv2 compatible are ok as well. Is there another high quality library similar to ICU? Platform: UNIX/POSIX, Windows support is not required. Thanks, Artyom Edit: Unfortunatly I wasn't logged in so I can't make asnver accepted... I had attached the ansver by myself.

    Read the article

  • Regular expression of unicode characters on string

    - by Marcus King
    I'm working in c# doing some OCR work and have extracted the text I need to work with. Now I need to parse a line using Regular Expressions. string checkNum; string routingNum; string accountNum; Regex regEx = new Regex(@"\u9288\d+\u9288"); Match match = regEx.Match(numbers); if (match.Success) checkNum = match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1); regEx = new Regex(@"\u9286\d{9}\u9286"); match = regEx.Match(numbers); if(match.Success) routingNum = match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1); regEx = new Regex(@"\d{10}\u9288"); match = regEx.Match(numbers); if (match.Success) accountNum = match.Value.Remove(match.Value.Length - 1, 1); The problem is that the string contains the necessary unicode characters when I do a .ToCharArray() and inspect the contents of the string, but it never seems to recognize the unicode characters when I parse the string looking for them. I thought strings in C# were unicode by default.

    Read the article

  • Problem using unicode in URLs with cgi.PATH_INFO in ColdFusion

    - by Loftx
    Hi there, My ColdFusion (MX7) site has search functionality which appends the search term to the URL e.g. http://www.example.com/search.cfm/searchterm. The problem I'm running into is this is a multilingual site, so the search term may be in another language e.g. ??????? leading to a search URL such as http://www.example.com/search.cfm/??????? The problem is when I come to retrieve the search term from the URL. I'm using cgi.PATH_INFO to retrieve the path of the search page and the search term and extracting the search term from this e.g. /search.cfm/searchterm however, when unicode characters are used in the search they are converted to question marks e.g. /search.cfm/??????. These appear actual question marks, rather than the browser not being able to format unicode characters, or them being mangled on output. I can't find any information about whether ColdFusion supports unicode in the URL, or how I can go about resolving this and getting hold of the complete URL in some way - does anyone have any ideas? Cheers, Tom

    Read the article

  • Why are there so many spaces and line breaks in Unicode?

    - by maaartinus
    Unicode has maybe 50 spaces \u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000][\u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000 and 6 line breaks not only CRLF, LF, CR, but also NEL (U+0085), PS (U+2029) and LS (U+2028). Maybe I could understand most of the spaces and PS ("Paragraph separator"), but what are "Next Line" and "Line separator" good for? It all looks like invented by a very big committee where everybody wanted their own space and the leaders were granted one line break each. But seriously, how do you deal with it when your programming language doesn't support it (or does it wrong as e.g. Java does)?

    Read the article

  • Strange characters appearing on websites - ASCII? - UNICODE?

    - by Mick
    I have created many very simple pure HTML websites over the years. Most of them appear to work fine most of the time. But there is one recurring problem which I have never quite sorted out involving strange characters. The scenario goes like this: I create the site. I look at it in my browser, everything appears fine. I may look at it a great many times over the coming weeks or months as I make additions here and there. Perhaps on a variety of browsers on a variety of PC's. Then one day I look at the page and see a random sprinkling of white question marks against dark diamond shapes. These might appear where I had expected to see hyphens or quotes or apostrophes. My immediate thought is that my browser got into some strange state because I was looking at some foreign website with strange characters, but I'm never quite sure. I'm left with that nagging feeling that perhaps half the planet is seeing my website with funny question marks all over it. So my question is what's going on? What should I do to ensure that as many people as possible around the world can view my text as I originally intended? Should I be using those special html sequences like &pound; for all non alphanumeric characters? Should I worry at all? Edit: Right now I have the problem occurring on this page: http://www.fullreservebanking.com/papers.htm ... part of it looks like this: I am using FireFox 5 and the character encoding currently appears to be "UNICODE (UTF-8)". I do not remember manually setting the character encoding to anything since installation. I do occasionally look at Japanese websites for work related reasons - though when I do so, I do not manually make any changes to firefox settings. Edit: Now fixed. Web page altered accordingly.

    Read the article

  • Why is Django reverse() failing with unicode?

    - by JeffS
    Here is a django models file that is not working as I would expect. I would expect the to_url method to do the reverse lookup in the urls.py file, and get a url that would correspond to calling that view with arguments supplied by the Arguments model. from django.db import models class Element(models.Model): viewname = models.CharField(max_length = 200) arguments = models.ManyToManyField('Argument', null = True, blank = True ) @models.permalink def to_url(self): d = dict( self.arguments.values_list('key', 'value') ) return (self.viewname, (), d) class Argument(models.Model): key = models.CharField(max_length=200) value = models.CharField(max_length=200) The value d ends up as a dictionary from a unicode string to another unicode string, which I believe, should work fine with the reverse() method that would be called by the permalink decorator, however, it results in: TypeError: reverse() keywords must be strings

    Read the article

  • What DVCS support Unicode filenames?

    - by Craig McQueen
    I'm interested in trying out distributed version control systems. git sounds promising, but I saw a note somewhere for the Windows port of git that says "don't use non-ASCII filenames". I can't find that now, but there is this link. It's put me off git for now, but I don't know if the other options are any better. Support for non-ASCII filenames is essential for my Japanese company. I'm looking for one that internally stores filenames as Unicode, not a platform-dependent encoding which would cause endless grief. So: What DVCS support Unicode filenames? In both Windows and Linux? Ideally, with the possibility to transfer repositories between Windows and Linux machines with minimal issues?

    Read the article

  • html tag attribute displayed in unicode

    - by user297975
    I have the following code, from which you can see that, I use the same way to create the text in utf-8. The text shown between html tags are shown corrently. But the text shown as html tag attribute are shown in unicode. I'm positive that on the server side(PHP), both texts are treated in the same way and are encoded in utf-8. Why the text as html tag attribute shown in unicode? ?????????????????????? ??

    Read the article

  • Font choices in International scenarios: multilingual vs unicode

    - by TravisO
    I have a website that will eventually display multiple languages. I notice the common fonts used in web CSS (ex: Arial, Verdana, Times New Roman, Tahoma) and even the newer Vista/Office 2007/VS2008 fonts (Calibri,Cambria, Candara, Corbel, etc) are significantly larger (~350K) than your average (US only?) TTF font (~50k) so these fonts contain most/all the major character sets that common languages (Spanish, French, German, etc) use. My question is, would somebody confirm that these fonts listed above are acceptable for international use of the major (let's say top 8) spoken languages? If so, then I'm guessing the only purpose of unicode fonts; such "Arial Unicode" (a massive 22mb) is only for dealing with extremely niche dialog, eastern glyphs (Chinese, Japanese) and dead languages? I'm just looking for some confirmation from developers that have their desktop apps/web apps rendering multiple languages and have a visual confirmation, I'm already in the 99% sure bin but you know what they say about assumption.

    Read the article

  • Track unicode words from Twitter using Ruby and the Tweetstream API

    - by Régis B.
    I am trying to track a set of keywords from Twitter by using the Streaming API (can't post the link here because of spam limitations: google twitter streaming API). I am doing this inside Ruby, using the TweetStream gem: http://bit.ly/cODAWI The problem I have is that I want to track keywords that contain some unicode/UTF-8 characters. For instance: require 'rubygems' require 'tweetstream' TweetStream::Client.new("my_user_name", "my_password").track("é") do |s| puts s.text end (you can try it out, provided you installed the tweetstream and json gems) This piece of code does not print anything, while replacing "é" with "e" outputs a bunch of tweets continuously. I did not find any reliable documentation about Unicode in Ruby, so I have no idea where the problem comes from. Thanks for your help!

    Read the article

  • Unicode data from NSData to NSString

    - by Jeff
    So if I have NSData from an HTTP request, then I do something like this: NSString *test = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding]; This will result in null if the data contains weird unicode data (title is from reddit): {"title":"click..¦¦me..and..then¦¦________ ¦¦check¦¦_.your...¦¦.__...¦¦____ ¦¦....¦¦¦¦¦¦¦¦¦¦¦¦¦¦....¦¦____ ¦¦¦¦¦¦....¦¦¦¦¦¦....¦¦¦¦¦¦____ ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦____ ....¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦______ ........¦¦..._recently....¦¦________ ....¦¦....viewed....links....¦¦_____"}, How would I convert the data to a string? Ideally, it would best if the string wasn't null so I could parse it as JSON, but even a lossy conversion is fine with me in these cases. I'm not familiar with unicode (naive American I am), so any enlightenment about that would be a nice bonus :)

    Read the article

  • Unicode and URI encoding, decoding and escaping in JavaScript

    - by apphacker
    If you look at this table here, it has a list of escape sequences for Unicode characters that don't actually work for me. For example for "%96", which should be a –, I get an error when trying decode: decodeURIComponent("%96"); URIError: URI malformed If I attempt to encode "–" I actually get: encodeURIComponent("–"); "%E2%80%93" I searched through the internet and I saw this page, which mentions using escape and unescape with decodeURIComponent and encodeURIComponent respectively. This doesn't seem to help because %96 doesn't show up as "–" no matter what I try and this of course wouldn't work: decodeURIComponent(escape("%96)); "%96" Not very helpful. How can I get "%96" to be a "–" with JavaScript (without hardcoding a map for every single possible unicode character I may run into)?

    Read the article

  • Regular expressions in python unicode

    - by Remy
    I need to remove all the html tags from a given webpage data. I tried this using regular expressions: import urllib2 import re page = urllib2.urlopen("http://www.frugalrules.com") from bs4 import BeautifulSoup, NavigableString, Comment soup = BeautifulSoup(page) link = soup.find('link', type='application/rss+xml') print link['href'] rss = urllib2.urlopen(link['href']).read() souprss = BeautifulSoup(rss) description_tag = souprss.find_all('description') content_tag = souprss.find_all('content:encoded') print re.sub('<[^>]*>', '', content_tag) But the syntax of the re.sub is: re.sub(pattern, repl, string, count=0) So, I modified the code as (instead of the print statement above): for row in content_tag: print re.sub(ur"<[^>]*>",'',row,re.UNICODE But it gives the following error: Traceback (most recent call last): File "C:\beautifulsoup4-4.3.2\collocation.py", line 20, in <module> print re.sub(ur"<[^>]*>",'',row,re.UNICODE) File "C:\Python27\lib\re.py", line 151, in sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or buffer What am I doing wrong?

    Read the article

  • Output Unicode to Console Using C++

    - by Jesse Foley
    I'm still learning C++, so bear with me and my sloppy code. The compiler I use is Dev C++. I want to be able to output Unicode characters to the Console using cout. Whenver i try things like: # #include directive here (include iostream) using namespace std; int main() { cout << "Hello World!\n"; cout << "Blah blah blah some gibberish unicode: ÐAßGg\n"; system("PAUSE"); return 0; } It outputs strange characters to the console, like µA¦Gg. Why does it do that, and how can i get to to display ÐAßGg? Or is this not possible with Windows?

    Read the article

  • What .NET UnmanagedType is Unicode (UTF-16)?

    - by Pat
    I am packing bytes into a struct, and some of them correspond to a Unicode string. The following works fine for an ASCII string: [StructLayout(LayoutKind.Sequential)] private struct PacketBytes { [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 64)] public string MyString; } I assumed that I could do [StructLayout(LayoutKind.Sequential)] private struct PacketBytes { [MarshalAs(UnmanagedType.LPWStr, SizeConst = 32)] public string MyString; } to make it Unicode, but that didn't work. (Since this field is part of a struct with other fields, which I've omitted for clarity, I can't simply change the CharSet of the containing struct.) Any idea what I'm doing wrong?

    Read the article

  • IE cannot download file with unicode pathname

    - by MM
    I have a web-app that allows users to upload and download image files by pressing buttons on a web page. A user of this page is reporting that IE 7 and 8 fail to download files when the files have Unicode pathnames. IE prompts the user with a dialog stating: "Internet explorer cannot download (file) at (webserver).". Unfortunately I have not been able to reproduce the problem using these versions on my machine. My question is, what could cause this, and how can I prevent it from happening? I have read about problems with cache control (I currently have it set to no-cache); however, I am not using HTTP-S, and the problem only occurs with file-names containing Unicode characters.

    Read the article

  • How to convert unicode character to its escaped ascii equivalent in c#

    - by Grant
    Hi, i am beginning with a string containing an encoded unicode character "& #xfc;". I pass the string to an object that performs some logic and returns another string. That string is converting the original encoded character to its unicode equivalent "ü". I need to get the original encoded character back but so far am not able. I have tried using the HttpUtility.HtmlEncode() method but that is returning "& #252;" which is not the same. Can anyone help?

    Read the article

  • c# unicode string output

    - by Reg
    I have function to convert string to a Unicode string: private string UnicodeString(string text) { return Encoding.UTF8.GetString(Encoding.ASCII.GetBytes(text)); } But when I am calling this function the output result is wrong. It looks like my function is not working. Console.WriteLine(UnicodeString("????? ?????")) printing on console just questions like that: ????? ???? Is there any way to say to console to display it correct? UPDATE Looks like the problem not in Unicode, I think may be it is displaying question marks because i am not having correct locale in the system (Windows 7)? Is there any way to make it work without changing locale?

    Read the article

  • The Road to Professional Database Development: Database Normalization

    Not only is the process of normalization valuable for increasing data quality and simplifying the process of modifying data, but it actually makes the database perform much faster. To prove the point, Peter Larsson takes a large unnormalised database and subjects it to successive stages of normalisation. Get smart with SQL Backup ProGet faster, smaller backups with integrated verification.Quickly and easily DBCC CHECKDB your backups. Learn more.

    Read the article

  • Python - pyparsing unicode characters

    - by mgj
    Hi..:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8) The code specifies the grammar and parses accordingly. 671.assess :: ????? ::2 x=number + "." + src + "::" + w + "::" + number + "." + number If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format. I mean that the code works when we have something of the form 671.assess :: ahsaas ::2 i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose. The python code looks like this: # -*- coding: utf-8 -*- from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , # grammar src = Word(printables) trans = Word(printables) number = Word(nums) x=number + "." + src + "::" + trans + "::" + number + "." + number #parsing for eng-dict efiledata = open('b1aop_or_not_word.txt').read() eresults = x.parseString(efiledata) edict1 = {} edict2 = {} counter=0 xx=list() for result in eresults: trans=""#translation string ew=""#english word xx=result[0] ew=xx[2] trans=xx[4] edict1 = { ew:trans } edict2.update(edict1) print len(edict2) #no of entries in the english dictionary print "edict2 has been created" print "english dictionary" , edict2 #parsing for hin-dict hfiledata = open('b1aop_or_not_word.txt').read() hresults = x.scanString(hfiledata) hdict1 = {} hdict2 = {} counter=0 for result in hresults: trans=""#translation string hw=""#hin word xx=result[0] hw=xx[2] trans=xx[4] #print trans hdict1 = { trans:hw } hdict2.update(hdict1) print len(hdict2) #no of entries in the hindi dictionary print"hdict2 has been created" print "hindi dictionary" , hdict2 ''' ####################################################################################################################### def translate(d, ow, hinlist): if ow in d.keys():#ow=old word d=dict print ow , "exists in the dictionary keys" transes = d[ow] transes = transes.split() print "possible transes for" , ow , " = ", transes for word in transes: if word in hinlist: print "trans for" , ow , " = ", word return word return None else: print ow , "absent" return None f = open('bidir','w') #lines = ["'\ #5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0 \ #5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0 \ #'"] data=open('bi_full_2','rb').read() lines = data.split('!@#$%') loc=0 for line in lines: eng, hin = [subline.split(' # ') for subline in line.strip('\n').split('\n')] for transdict, source, dest in [(edict2, eng, hin), (hdict2, hin, eng)]: sourcethings = source[2].split() for word in source[1].split(): tl = dest[1].split() otherword = translate(transdict, word, tl) loc = source[1].split().index(word) if otherword is not None: otherword = otherword.strip() print word, ' <-> ', otherword, 'meaning=good' if otherword in dest[1].split(): print word, ' <-> ', otherword, 'trans=good' sourcethings[loc] = str( dest[1].split().index(otherword) + 1) source[2] = ' '.join(sourcethings) eng = ' # '.join(eng) hin = ' # '.join(hin) f.write(eng+'\n'+hin+'\n\n\n') f.close() ''' if an example input sentence for the source file is: 1# 5 # modern markets : confident consumers # 0 0 0 0 0 1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0 !@#$% the ouptut would look like this :- 1# 5 # modern markets : confident consumers # 1 2 3 4 5 1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0 !@#$% Output Explanation:- This achieves bidirectional alignment. It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop. The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping. What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.

    Read the article

  • problem using getline with a unicode file

    - by hamishmcn
    UPDATE: Thank you to @Potatoswatter and @Jonathan Leffler for comments - rather embarrassingly I was caught out by the debugger tool tip not showing the value of a wstring correctly - however it still isn't quite working for me and I have updated the question below: If I have a small multibyte file I want to read into a string I use the following trick - I use getline with a delimeter of '\0' e.g. std::string contents_utf8; std::ifstream inf1("utf8.txt"); getline(inf1, contents_utf8, '\0'); This reads in the entire file including newlines. However if I try to do the same thing with a wide character file it doesn't work - my wstring only reads to the the first line. std::wstring contents_wide; std::wifstream inf2(L"ucs2-be.txt"); getline( inf2, contents_wide, wchar_t(0) ); //doesn't work For example my if unicode file contains the chars A and B seperated by CRLF, the hex looks like this: FE FF 00 41 00 0D 00 0A 00 42 Based on the fact that with a multibyte file getline with '\0' reads the entire file I believed that getline( inf2, contents_wide, wchar_t(0) ) should read in the entire unicode file. However it doesn't - with the example above my wide string would contain the following two wchar_ts: FF FF (If I remove the wchar_t(0) it reads in the first line as expected (ie FE FF 00 41 00 0D 00) Why doesn't wchar_t(0) work as a delimiting wchar_t of "00 00"? Thank you

    Read the article

  • Allowed unicode characters in IDN host labels

    - by Roland Franssen
    Hi all, Im currently working on a "proper" URI validator and currently it all comes down to hostname validation, the rest isnt that tricky. Im stuck at IDN hostname labels (e.g. containing unicode; possible punycode encoded strings have been decoded at this point). My first idea was basicly a regex for TLD's not supporting IDN and one for those who do (http://www.mozilla.org/projects/security/tld-idn-policy-list.html (?)). Respectively; ^[a-zA-Z0-9-]+$ and ^[a-zA-Z0-9-\p{L}]+$ However this is not an ideal situation, since every IDN registrar can decide which characters to allow and which not. What im looking for is a proper, consistent, up2date data table of unicode characters allowed in various TLD's; im getting this idea i have to find all the data myself at russian and chinese registry sites (which is quite difficult). So before spitting down the web.. i wondered is there such a list? Or are there better approaches, best/common practices etc? (I want the validation to be as strict as possible.) Any help is welcome! // Roland

    Read the article

  • iPhone app rejection for using ICU (Unicode extensions)

    - by nickbit
    I received the following mail form Apple, considering my application: *Thank you for submitting your update to ??µ??es?a to the App Store. During our review of your application we found it is using private APIs, which is in violation of the iPhone Developer Program License Agreement section 3.3.1; "3.3.1 Applications may only use Documented APIs in the manner prescribed by Apple and must not use or call any private APIs." While your application has not been rejected, it would be appropriate to resolve this issue in your next update. The following non-public APIs are included in your application: u_isspace ubrk_close ubrk_current ubrk_first ubrk_next ubrk_open If you have defined methods in your source code with the same names as the above mentioned APIs, we suggest altering your method names so that they no longer collide with Apple's private APIs to avoid your application being flagged with future submissions. Please resolve this issue in your next update to ??µ??es?a. Sincerely, iPhone App Review Team* The functions mentioned in this mail are used in the ICU library (International Components for Unicode). Although my app is not rejected at this point, I don't feel very secure for the future of my app, because it relies heavily on the Unicode protocol and on this components in particular. Another thing is that I do not call these functions directly, but they are called by a custom 'sqlite' build (with FTS3 extensions enabled). Am I missing something here? Any suggestions?

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >