Skip to content
initial analysis of modify string authored by Samuel Melm's avatar Samuel Melm
......@@ -5,38 +5,85 @@ Modify string includes all scenarios that are applied to a string column and tra
Trifacta lists the following sub-scenarios:
- Basic String Operations
- Prepend/Append to string ("foo" + "bar" -> "foobar")
- Substring (e.g. "Hello World"[1,4] -> "ello")
- Replace substring with other string
- Others that don't qualify as "modify string" (length, isSubstring, indexOf...)
- Clean Strings
- Define and enforce formats (e.g. url/zip code) and replace wrong values with defaults
- Remove from string:
- leading/trainling whitespaces (and tabs) (aka trim)
- first/last word(s)
- all whitespace in string
- double spaces (replace by single spaces)
- symbols (everything but letters and numbers) especially includes punctuation
- special letters (e.g. transform ä to a)
- a specific sub-string (all or first occurence)
- quotes arround the string
- Standardize String Values
- Standardize case: toLower, toUpper
- Break out CamelCase ("fooBar" -> "foo bar")
- Standardize String Lengths
- Pad string values (e.g. leading/trailing spaces)
- Fix the length of strings (e.g. remove all characters after the 8th)
Excluded from this analysis:
- Convert colums to one string column (aka printf), because it does not modify strings.
## Scenario and user-visible elements
- Fill out
- User has a table with one or more string colums loaded into the editor.
- He/She selects the column and applies one or more of the following operations to it.
- Let us assume the colum is called "text" with rows ["foo", "bar"]
### Basic String Operations
#### Prepend
- The user enters a string ("baz") which is added in front of every row of the column.
- Input: ["foo", "bar"]
- Result: ["bazfoo", "bazbar"]
#### Append
- The user enters a string ("baz") which is added behind every row of the column.
- Input: ["foo", "bar"]
- Result: ["foobaz", "barbaz"]
#### Substring
- The user enters the index of the first and last char of the substring to be selected (1, 2)
- Input: ["foo", "bar"]
- Result: ["oo", "ar"]
#### Replace substring with other string
- The user enters a substring ("oo") which is replaced by another entered substring ("aa")
- Input: ["foo", "bar"]
- Result: ["faa", "bar"]
### Clean Strings
#### Define and Enforce Formats
- The user enters a pattern which every entry of the column should follow
- He/She can then specify what should happend with the entries that do not follow the pattern (e.g. be replaced with a default string)
#### Remove from String
- The user selects what should be removed from the string, this could be:
* a specific sub-string where either only the first or all ocurrences are removed
* leading/trainling whitespaces
* first/last word(s)
* all whitespace in the string
* double spaces which are replaced by a single space
* symbols (everything but letters and numbers)
* non-ascii letters (e.g. ä is replaced by a)
* surrounding quotes arround the string
### Standardize case
#### To Lower
- Every uppercase letter is replaced by its lower case counterpart
- Example: "fOoBar$" -> "foobar$"
#### To Upper
- Every lowercase letter is replaced by its uppercase counterpart
- Example: "fOoBar$" -> "FOOBAR$"
#### Break out Special Case
- The user selects a special form of case (e.g. camelCase). Every entry is then transformed into its words according to the selected case.
- Example: "fooBarBaz" -> "foo bar baz"
### Standardize String Lengths
#### Pad string values
- The user specifies a number and a character. This character is added to the front/back of every entry until it has the specified length.
- Pad("foo", ' ', 5) -> " foo"
#### Fix the Length of Strings
- The users specifies a length. Characters from the back get removed until every entry is smaller than the specified length.
- FixLength("foofoofoo", 7) -> "foofoof"
## Atomic operations
- nearly all of these scenarios can be covert by a regex engine
- Replace regex
- Prepend
- Append
- Substring
- Matches Pattern
- Replace If Matches/Not Matches
- ToLower
- ToUpper
- Pad
- Note: text in bold tries to suggest "atomic operation"
- Fill out
\ No newline at end of file