05988Bug reportsConditionspublic2012-08-04 14:51
ReporterLise94 Assigned Toc_schmitz  
Status closedResolutionfixed 
Product Version1.92+ 
Fixed in Version1.92+ 
Problem displaying words with French accent within Expression Manager


I want to use expression manager to display a title question depending of a previous answer. No problem to do that,

but when an accent is in the expression manager, then, the question is not displayed properly and the respondent can see é instead of é :

this problem is only seen in the respondent screen, and not in the admin part.

see file attached for an exemple

Complete LimeSurvey version number (& build)120330
I will donate to the project if issue is resolvedNo
Database type & versionmysql
Server OS (if known)Windows
Webserver software & version (if known)?
PHP Version5

2012-04-05 19:58

reporter   ~18236

the respondent see & e a c u t e ; instead of é
(I have to put spaces between the letter, otherwise, you can't see the "word" that is really displayed)



2012-04-07 16:32

reporter   ~18247

c_schmitz - the value in the database has the "wrong" value - it contains "Pourquoi {if((boisson.code=='Y'),'aimez-vous le thé','non')} ?"

EM uses htmlspecialchars_decode() when processing database values. Should it be using html_entities_decode(), or should the database be storing non-entity-encoded values?



2012-04-07 17:08

administrator   ~18250

The database should always store the pure UTF-8 encoded non-entity-encoded value.



2012-04-07 17:09

reporter   ~18251

OK, so EM is working properly, and somehow the wrong value is being inserted into the database.



2012-04-07 17:48

developer   ~18253

Maybe something with database.

I try: importing lss file, and look and save question text.

  • Using : no HTML editor, corrected éto é : Work fine.
  • Activate inline HTML editor : work fine : in db have é, not é
  • Using popup HTML editor, open the HTML editor, don't corrected ( lot of'), save : OK in db.

Then i test with:
Import lss file, direct testing : Not OK.
in db:
aimez-vous le th& eacute ;?
Pourquoi {if((boisson.code=='Y'),'aimez-vous le th& eacute ;','non')} ?
LEM change & to & amp;

Did LEM have to accept utf8 encoding AND html encoding ? Then don't replace & to & amp;



2012-04-07 18:05

developer   ~18254

Javascript is good:

LEMif(LEManyNA('boisson.code'),'',(LEMif((LEMval('boisson.code') == 'Y'), 'aimez-vous le th& eacute;', 'non')))); (without & amp;)

It's a javascript issue , see some example :



2012-04-07 18:38

administrator   ~18257

Last edited: 2012-04-07 18:41

TMSWhite: I am sorry, but when I meant 'should' it is not meant in an absolute way. As long as it is valid HTML also entity-encoded is allowed, preferred is non-encoded though.



2012-04-07 18:39

reporter   ~18258

One core question is how to properly protect EM from cross-site scripting attacks.

I used htmlspecialchars, which takes care of '>','<','"',"'", and '&'. Sounds like we may not need to escape '&'.

However the main issue is that the database is getting the wrong value stored. If you fix the database contents, you'll see the following generated:

LEMif(LEManyNA('boisson.code'),'',(LEMif((LEMval('boisson.code') == 'Y'), 'aimez-vous le thé', 'non'))));



2012-04-07 18:48

reporter   ~18260


We could have EM process all content through html_entities_decode(), but that is potentially risky. It will work if the original is encoded through html_entities_encode(), but if the original is only processed through html_specialchars(), we might get the wrong result.

It would be nice to ensure we're consistent in the database and always encode/decode with the same function.



2012-04-07 19:08

administrator   ~18261

I don't understand why we might get the wrong result?

AFAIK htmlentities_decode does not care if encoding with html_specialchars or html_entities_encode was done. If we assume that always valid HTML is supposed to be used it should be fine.



2012-04-07 19:14

developer   ~18262

Last edited: 2012-04-07 19:18

Carsten: it's javascript functionality.

If you make :
< div id="myelement" >< /div >
< script >
< /script >

You get:
< div id="myelement" >& amp ;< /div >
and not
< div id="myelement" >&< /div >

EDIT : mlaybe try with



2012-04-07 19:30

reporter   ~18263

Denis - if so, where in the source code is the problem? Somewhere before insertion into the database, I presume.


You are right - I thought there might be a way to do injection attacks by first encoding with html_spacialchars() and then decoding with html_entities_decode(), but I can't create a working test case of that.



2012-04-07 21:41

reporter   ~18266

Last edited: 2012-04-07 21:48


I just tried a dozen permutations of changing some or all of the htmlspecialchars_decode() within EM to html_entity_decode(). All made it worse.

I also remember spending nearly 20 hours last year trying to get the right balance of specialchars and entity encode and decode.

I think we should insist that the database be htmlspecialchars() encoded and not html_entities() encoded. Trying to mix and match the two types of encoding is a big mess.

So, the core fix would be to ensure that entity-encoded data doesn't get into the database.



2012-04-08 12:06

developer   ~18278

I try to have html_entities in db, the onky way was to import the lss file and don't change the question text (don't save).

To fix, the only way seem, in em_javascript.js:

/* Display number with comma as radix separator, if needed
function LEMfixnum(value)
var newval = String(value);
if (parseFloat(newval) != value) {
-- return htmlspecialchars(value); // unchanged
++ return value; // really unchanged
if (LEMradix===',') {
newval = newval.split('.').join(',');
return newval;
return value;



2012-04-09 15:10

reporter   ~18284

I think we need a group decision on this since it affects how we handle cross-site scripting protection.

This we want to automatically do:
(1) Ensure that if a person enters a script into a text entry box, they can't use EM to substitute that script elsewhere (so anything entered in a text entry box should be processed through htmlspecialchars()

Things I'm not sure about:
(1) If people enter markup (like a script) into an equation variable, should it be possible to insert that markup into something else via EM? Perhaps, since the XSS protections tend to protect against accidental creation of markup or scripts via the editor
(2) Similarly,should people be able to write {if(true,'
','')} and have EM insert markup via normal substitution?
(3) If someone enters an answer that contains markup, should the database store the markup as entered, or should it first be processed through htmlspecialchars()?



2012-04-09 15:43

administrator   ~18285

Last edited: 2012-04-09 15:45

1) [...]be processed through htmlspecialchars() on output, yes

1.) Think so, yes. It should obey the XSS global filter setting in the admin, though. So if someone tries to save an equation like and the according filter setting is activated it should be filtered accordinlgy.

2.) People? Who is that?

3.) Store as entered. Consider raw data always unsafe to be displayed, but safe to store (assuming that that SQL-injection safe storing methods are used).



2012-04-09 15:48

reporter   ~18286

For 1 & 2, people = survey authors.



2012-04-09 16:12

administrator   ~18289

Last edited: 2012-04-09 16:12

Then 2.) Yes, they should be able to do that.



2012-04-09 16:41

reporter   ~18292

OK. That isn't how EM currently handles this, so I (or someone else) will need to look through all of the existing calls to htmlspecialchars() within EM and make the needed adjustments.

In general, seems like:
(1) static strings withing expressions should be output without processing through htmlspecialchars()
(2) string content outside of expressions (e.g. text of questions and answers) should be output without processing through htmlspecialchars()
(3) variable values from free-text fields (e.g. a subset of .code, .value, and .shown, including "comment" and "other" fields) should be processed through htmlspecialchars().
(4) tool tips should use htmlentitites() without double encoding.



2012-04-09 17:08

administrator   ~18295

That proposal sounds fine to me.



2012-04-11 19:13

reporter   ~18321

attached sample survey for testing special characters. It is possible it won't import correctly. It uses the following text repeatedly:

This <question> has "special chars" including '&'; foreign chars like ßüöäÜÖħéèàçù£, and entities like < > é " ' £ é ò ô

However, that text gets converted at many points, especially during editing. So, consistent handling of entities may be more pervasive than just EM.



2012-04-11 19:15

reporter   ~18322

Last edited: 2012-04-11 19:15

Hmm, even Mantis does conversion - the section after "and entities like" spells out the actual entities:

Here they are with the ampersand replaced by a tilde so that they don't get converted:

and entities like ~lt; ~gt; ~eacute; ~quot; ~apos; ~pound; ~eacute; ~ograve; ~ocirc;



2012-04-16 09:37

administrator   ~18361

Can you point out where it is still going 'wrong' beside EM?



2012-04-16 17:21

reporter   ~18379

The short answer is that the problem appears to be isolated to EM, but looking at the code, I'm worried it may be more pervasive (which might explain why it took so long to get it "right" in EM in the first place).

When I do a regular expressions search of (/html(specialchars|_entity|entities)/ of 1.92 codebase (searching all .php and .js source, I get 542 matches across 78 files.

Among them, we have several versions of these JavaScript functions:
(1) em_javascript.js and admin_core.js and browse.js have incompatible versions of htmlspecialchars()
(2) em_javascript.js and translation.js have incompatible versions of html_entity_decode()

We should probably move the functions out of em_javascript.js and maintain them separately. At present, I didn't want to touch this since I don't know whether standardizing those functions would break anything.

Also, I'm not clear on some basics, such as, "how should strings be natively stored within":
(1) database
(3) $fieldmap
(4) other?

And what is the proper way map from one to the other?
(1) database<=>$_SESSION for values
(2) database<=>question "object" for question text, help, relevance, etc. (e.g. database=>$fieldmap)
(3) question object=>HTML nodes
(4) question object=>HTML attributes
(5) $_SESSION values=>HTML input elements

Furthermore, if we do encoding, should we be doing double encoding and double decoding, or not. For example, "&" => "&amp;"?

My recommendation would be to have someone create succinct coding guidance documentation to indicate the proper way to manage special characters and entities within LS. That way we can (perhaps slowly) look through those 542 matches and make sure they are all correct and consistent.



2012-04-17 09:45

administrator   ~18402

Last edited: 2012-04-17 09:45

As mentioned earlier:

Strings should always be stored raw, be it in database, fieldmap or session.

On display:
Strings entered by survey participants should always be htmlspecialchars-encoded (as they are considere unsafe)
Strings entered by survey admin in the administration should not be encoded and displayed as they are (as they are considered to be safe or XSS-filtered)



2012-04-20 09:24

administrator   ~18437

Fix committed to master branch:



2012-04-20 09:26

administrator   ~18438

Fix committed to Yii branch:



2012-05-01 11:56

administrator   ~18519

New 1.92+ build released

