Are you one of the many Excel users out there having to deal with text fields in unusable formats? If so this guide will help you gain an understanding of how to use Excel’s SEARCH, RIGHT, LEFT and LEN functions to convert that text into meaningful data…
Data comes to us in many forms and formats and if you have been in the analyst game for long enough you will have seen many different formats for data, some of which are less than user-friendly (and that’s being polite!)
A common problem I come across in my consulting is text fields that contain multiple pieces of information. Combined in one text field they are fairly useless to an analyst, but broken out correctly we can actually use them in reporting or for further analysis.
To help you understand the basics of how to approach here is a guide which I hope you find useful.
A manager has asked for a sales report by staff name and by region for the Supplier, ZZZ Corp. You contact ZZZ Corp and request the data, which comes to you looking like this:
Now that’s not very useful of ZZZ Corp is it!
The “Region and Name Key” have been combined for some reason unbeknown to us but if we can split those apart we will have all the data we need to complete our report.
The first step in any problem like this is to analyse the structure of the text field. In this case we can see the region is listed first, then an underscore (_), which separates the region from the name. The underscore is critical to us because we can see that for region we want to capture all the text before the underscore, and for name we want to capture all the text after the underscore. The problem is the underscore varies in position depending on how long the region text is so we need to use an Excel function that can take that into account and find the right position for the underscore regardless of the region text.
The SEARCH function takes the format:
SEARCH(find_text, within_text, [start number])
The result of the function is the position of the text we are “searching” for in the cell we specify so this will be perfect to solve our problem and find the underscore every time.
Let’s see it in action:
As the above shows we search for the underscore (our find text part) in the relevant cell (our within text part) and as we are starting from the beginning of the text we can ignore the optional [Start number] portion of the formula. The result gives us the number of characters, from the start of the text field until we reach the underscore. So cell A2 shows “London_Dave Smith” and counting along we can see that the underscore is the 7th character from the start:
Now we know how to find the position of the underscore we can develop our formula to capture the region and the name from the text field.
Region is at the start of our text field and is always all of the text before the underscore so we want to capture everything up to the underscore. To do this we need to use Excel LEFT Function, left because it is working from the left, or start, of the text field.
The LEFT function takes the format:
In our example the text part is our cell containing the text, and the [num_chars] part is how many characters we need to count from the left in order to capture the region text. We know we can’t just enter a number here because it varies for each record, however we can combine with the SEARCH function to tell Excel exactly how many characters to count each time.
Let’s combine the two and apply the formula to our data:
As you can see we start the LEFT function as normal but in place of the number of characters we use the SEARCH formula we created earlier. It’s important to note that we also have a -1 on the end of our formula. This is because the result of the SEARCH formula is the character position of the underscore, if we omit the -1 part so the formula in C2 read =LEFT(A2, SEARCH(“_”,A2)) then the result would be “London_”. We don’t want the underscore in our region text so we simply tell Excel to count to the underscore, then minus one so we don’t actually capture it.
That’s the region part sorted.
The name is everything to the right of the underscore, so your first thought might be to apply the RIGHT function in the same way as the LEFT. The RIGHT function takes the same format as the LEFT function:
But as you would expect it counts from the right, or the end, of the text. That poses us a small problem as the results of our SEARCH function, which tells us where the underscore is, has counted from the left and there is no way to make it count from the right (you can create your own VBA script to that but that’s for another day!)
Let’s try implementing the RIGHT function in the same way as we did for the LEFT function and you can see the problem:
The results are not what we wanted.
Looking at cell A2 the formula is working out where the underscore is (character position 7) and counting 7 from the right, hence we cut off part of the staff name.
In order to count the correct number of characters required from the right we’ll need to do a little maths.
- We can see above that the total length of the example text field is 17 characters
- We know the position of the Underscore when counting from the left is the 7th character
- We know we don’t want anything to the left of the 7th character; we want everything to the right.
Therefore 17 characters in length, minus the 7 characters we don’t want to include will leave us aiming to capture 10 characters from the right, or end of the text, in order to get the full name:
As we can find the underscore position with our SEARCH formula created earlier the only piece of the equation missing is the length of the text in the cell, and Excel has a handy function for that – the LEN function. The LEN function takes the format:
Where “text” is the cell we want to count the length of.
The result of the LEN function will tell us the entire length of the text in the cell, in characters, so we can combine this with the results of the SEARCH function to replicate the maths we did above.
= LEN(A2) gives us 17
= SEARCH(“_”, A2) gives us 7
= LEN(A2) – SEARCH(“_”, A2) gives us 10
Now we are getting close. Let’s replace that old formula which didn’t work and build this new logic into our RIGHT function so we are counting the correct number of characters, from the right, to capture the entire name:
And there we have it; the full name is selected each time.
We can now take that fairly useless cell of text, split out the region and split out the name so we can complete a meaningful report for our manager showing sales by Region or sales by Name.
The concepts in this guide will be useful for approaching any text that needs to be segmented, not just in the format we have used in the example. I have seen many crazy looking text fields in my time so if you have some of your own why not try and implement these concepts and get that data into a useful format. If you are really having difficulties with a complicated one then post it in a comment below and we’ll get to work solving it with you.