Split Join

lua-users home
wiki

"split" [1] and "join" [2] are two common string operators that are essentially inverse operations of each other. Split separates a string containing a delimiter into the list of substrings between that delimiter. Join combines a list of strings into a new string by inserting a delimiter between each string.

There are various ways to design and implement these functions in Lua, as described below.

Joining list of strings

With Lua 5.x you can use table.concat[3] for joining: table.concat(tbl, delimiter_str).

table.concat({"a", "b", "c"}, ",") --> "a,b,c"

Other interfaces are possible, largely dependent on the choice of split interface since join is often intended to be the inverse operation of split.

Splitting Strings

First of all, although Lua does not have a split function is its standard library, it does have string.gmatch[4], which can be used instead of a split function in many cases. Unlike a split function, string.gmatch takes a pattern to match the non-delimiter text, instead of the delimiters themselves:

local example = "an example string"

for i in string.gmatch(example, "%S+") do

  print(i)

end



-- output:

-- an

-- example

-- string

A split[1] function separates a string into a list of substrings, breaking the original string on occurrences of some separator (character, character set, or pattern). There are various ways to design a string split function. A summary of the design decisions is listed below.

Should split return a table array, a list, or an iterator?

split("a,b,c", ",") --> {"a", "b", "c"}

split("a,b,c", ",") --> "a","b","c" (not scalable: Lua has a limit of a few thousand return values)

for x in split("a,b,c", ",") do ..... end

Should the separator be a string, Lua pattern, LPeg pattern, or regular expression?

split("a  +b c", " +") --> {"a ", "b c"}

split("a  +b c", " +") --> {"a", "+b", "c"}

split("a  +b c", some_other_object) --> .....

How should empty separators be handled?

split("abc", "") --> {"a", "b", "c"} 

split("abc", "") --> {"", "a", "b", "c", ""}

split("abc", "") --> error

split("abc", "%d*") --> what about patterns that can evaluate to empty strings?

Note: split(s,"") is a convenient idiom for splitting a string into characters. In Lua, we can alternately do for c in s:gmatch"." do ..... end.

How should empty values be handled?

split(",,a,b,c,", ",") --> {"a", "b", "c"}

split(",,a,b,c,", ",") --> {"", "", "a", "b", "c", ""}

split(",", ",") --> {} or {""} or {"", ""} ?

split("", ",") --> {} or {""} ?

Note: Although splitting and joining are roughly inverses, the operations are not necessarily uniquely determined, particularly when there are empty strings. join({"",""}, ""), join({""}, "") and join({}, "") all result in the same string "". Therefore, the choice of what the inverse operation split("", "") should return is not immediately clear.

Note: completely ignoring empty values can be undesirable, such as for rows in a CSV file where column positions matter. A CSV file with empty rows "" is unclear: is this an column containing an empty value or are there zero columns? A zero column CSV file is unlikely though perhaps not impossible.

Should there be an argument to limit the number of splits?

split("a,b,c", ",", 2) --> {"a", "b,c"}

Should the separator be returned? This is more useful when the separator is a pattern, in which case the separator can vary:

split("a  b c", " +") --> {"a", "  ", "b", " ", "c"}

Note: Note also that string.gmatch [5] is in a way a dual of split, returning the substrings that match a pattern and discarding strings between them rather than the other way around. A function that returns both is sometimes called partition [6].


Method: Split by pattern using string.gsub/string.match

Break a string up at occurrences of a single character. If the number of fields is known:

str:match( ("([^"..sep.."]*)"..sep):rep(nsep) )

If the number of fields is not known

fields = {str:match((str:gsub("[^"..sep.."]*"..sep, "([^"..sep.."]*)"..sep)))}

Some might call the above a hack :) sep will need to be escaped if it is a pattern metacharacter, and you'd probably be better off precomputing and/or memoizing the patterns.


Method: Using only string.gsub

fields = {}

str:gsub("([^"..sep.."]*)"..sep, function(c) table.insert(fields, c) end)

Does not work as expected:

str, sep = "1:2:3", ":"

fields = {}

str:gsub("([^"..sep.."]*)"..sep, function(c) table.insert(fields, c) end)

for i,v in ipairs(fields) do print(i,v) end



-- output:

-- 1        1

-- 2        2

Fix:

function string:split(sep)

        local sep, fields = sep or ":", {}

        local pattern = string.format("([^%s]+)", sep)

        self:gsub(pattern, function(c) fields[#fields+1] = c end)

        return fields

end

Example: split a string into words, or return nil

function justWords(str)

  local t = {}

  local function helper(word) table.insert(t, word) return "" end

  if not str:gsub("%w+", helper):find"%S" then return t end

end


Method: Split a string with a pattern, Take One

This splits a string using the pattern sep. It calls func for each segment. When func is called, the first argument is the segment and the remaining arguments are the captures from sep, if any. On the last segment, func will be called with just one argument. (This could be used as a flag, or you could use two different functions). sep must not match the empty string. Enhancements are left as an exercise :)

func((str:gsub("(.-)("..sep..")", func)))

Example: Split a string into lines separated by either DOS or Unix line endings, creating a table out of the results.

function lines(str)

  local t = {}

  local function helper(line) table.insert(t, line) return "" end

  helper((str:gsub("(.-)\r?\n", helper)))

  return t

end

Function: Split a string with a pattern, Take Two

The problem with using gsub as above is that it can't handle the case when the separator pattern doesn't appear at the end of the string. In that case the final "(.-)" never gets to capture the end of the string, because the overall pattern fails to match. To handle that case you have to do something a little more complicated. The split function below behaves more or less like split in perl or python. In particular, single matches at the beginning and end of the string do not create new elements. Multiple matches in a row create empty string elements.

-- Compatibility: Lua-5.1

function split(str, pat)

   local t = {}  -- NOTE: use {n = 0} in Lua-5.0

   local fpat = "(.-)" .. pat

   local last_end = 1

   local s, e, cap = str:find(fpat, 1)

   while s do

      if s ~= 1 or cap ~= "" then

	 table.insert(t,cap)

      end

      last_end = e+1

      s, e, cap = str:find(fpat, last_end)

   end

   if last_end <= #str then

      cap = str:sub(last_end)

      table.insert(t, cap)

   end

   return t

end

Example: Split a file path string into components.

function split_path(str)

   return split(str,'[\\/]+')

end



parts = split_path("/usr/local/bin")

--> {'usr','local','bin'}

Test Cases:

split('foo/bar/baz/test','/')

--> {'foo','bar','baz','test'}

split('/foo/bar/baz/test','/')

--> {'foo','bar','baz','test'}

split('/foo/bar/baz/test/','/')

--> {'foo','bar','baz','test'}

split('/foo/bar//baz/test///','/')

--> {'foo','bar','','baz','test','',''}

split('//foo////bar/baz///test///','/+')

--> {'foo','bar','baz','test'}

split('foo','/+')

--> {'foo'}

split('','/+')

--> {}

split('foo','')  -- opps! infinite loop!

Function: Split a string with a pattern, Take Three

After a discussion on this topic in the mailing list, I made my own function... I took, unknowingly, a way similar to the function above, except I use gfind to iterate, and I see the single matches at beginning and end of string as empty fields. As above, multiple successive delimiters create empty string elements.

-- Compatibility: Lua-5.0

function Split(str, delim, maxNb)

    -- Eliminate bad cases...

    if string.find(str, delim) == nil then

        return { str }

    end

    if maxNb == nil or maxNb < 1 then

        maxNb = 0    -- No limit

    end

    local result = {}

    local pat = "(.-)" .. delim .. "()"

    local nb = 0

    local lastPos

    for part, pos in string.gfind(str, pat) do

        nb = nb + 1

        result[nb] = part

        lastPos = pos

        if nb == maxNb then break end

    end

    -- Handle the last field

    if nb ~= maxNb then

        result[nb + 1] = string.sub(str, lastPos)

    end

    return result

end

Test Cases:

ShowSplit("abc", '')

--> { [1] = "", [2] = "", [3] = "", [4] = "", [5] = "" }

-- No infite loop... but garbage in, garbage out...

ShowSplit("", ',')

--> { [1] = "" }

ShowSplit("abc", ',')

--> { [1] = "abc" }

ShowSplit("a,b,c", ',')

--> { [1] = "a", [2] = "b", [3] = "c" }

ShowSplit("a,b,c,", ',')

--> { [1] = "a", [2] = "b", [3] = "c", [4] = "" }

ShowSplit(",a,b,c,", ',')

--> { [1] = "", [2] = "a", [3] = "b", [4] = "c", [5] = "" }

ShowSplit("x,,,y", ',')

--> { [1] = "x", [2] = "", [3] = "", [4] = "y" }

ShowSplit(",,,", ',')

--> { [1] = "", [2] = "", [3] = "", [4] = "" }

ShowSplit("x!yy!zzz!@", '!', 4)

--> { [1] = "x", [2] = "yy", [3] = "zzz", [4] = "@" }

ShowSplit("x!yy!zzz!@", '!', 3)

--> { [1] = "x", [2] = "yy", [3] = "zzz" }

ShowSplit("x!yy!zzz!@", '!', 1)

--> { [1] = "x" }



ShowSplit("a:b:i:p:u:random:garbage", ":", 5)

--> { [1] = "a", [2] = "b", [3] = "i", [4] = "p", [5] = "u" }

ShowSplit("hr , br ;  p ,span, div", '%s*[;,]%s*')

--> { [1] = "hr", [2] = "br", [3] = "p", [4] = "span", [5] = "div" }

(PhilippeLhoste)


Function: Perl-like split/join

Many people miss Perl-like split/join functions in Lua. Here are mine:

-- Concat the contents of the parameter list,

-- separated by the string delimiter (just like in perl)

-- example: strjoin(", ", {"Anna", "Bob", "Charlie", "Dolores"})

function strjoin(delimiter, list)

  local len = getn(list)

  if len == 0 then 

    return "" 

  end

  local string = list[1]

  for i = 2, len do 

    string = string .. delimiter .. list[i] 

  end

  return string

end



-- Split text into a list consisting of the strings in text,

-- separated by strings matching delimiter (which may be a pattern). 

-- example: strsplit(",%s*", "Anna, Bob, Charlie,Dolores")

function strsplit(delimiter, text)

  local list = {}

  local pos = 1

  if strfind("", delimiter, 1) then -- this would result in endless loops

    error("delimiter matches empty string!")

  end

  while 1 do

    local first, last = strfind(text, delimiter, pos)

    if first then -- found?

      tinsert(list, strsub(text, pos, first-1))

      pos = last+1

    else

      tinsert(list, strsub(text, pos))

      break

    end

  end

  return list

end

(PeterPrade)


Function: Perl-like split/join, alternative

Here's my own split function, for comparison. It's largely the same as the above; not quite as DRY but (IMO) slightly cleaner. It doesn't use gfind (as suggested below) because I wanted to be able to specify a pattern for the split string, not a pattern for the data sections. If speed is paramount, it might be made faster by caching string.find as a local 'strfind' variable, as the above does.

--Written for 5.0; could be made slightly cleaner with 5.1

--Splits a string based on a separator string or pattern;

--returns an array of pieces of the string.

--(May optionally supply a table as the third parameter which will be filled 

with the results.)

function string:split( inSplitPattern, outResults )

  if not outResults then

    outResults = { }

  end

  local theStart = 1

  local theSplitStart, theSplitEnd = string.find( self, inSplitPattern, 

theStart )

  while theSplitStart do

    table.insert( outResults, string.sub( self, theStart, theSplitStart-1 ) )

    theStart = theSplitEnd + 1

    theSplitStart, theSplitEnd = string.find( self, inSplitPattern, theStart )

  end

  table.insert( outResults, string.sub( self, theStart ) )

  return outResults

end

(GavinKistner)


Function: PHP-like explode

Explode string into table with seperator (moved from TableUtils):

-- explode(seperator, string)

function explode(d,p)

  local t, ll

  t={}

  ll=0

  if(#p == 1) then return {p} end

    while true do

      l=string.find(p,d,ll,true) -- find the next d in the string

      if l~=nil then -- if "not not" found then..

        table.insert(t, string.sub(p,ll,l-1)) -- Save it in our array.

        ll=l+1 -- save just after where we found it for searching next time.

      else

        table.insert(t, string.sub(p,ll)) -- Save what's left in our array.

        break -- Break at end, as it should be, according to the lua manual.

      end

    end

  return t

end

Here's my version of PHP style explode, supporting limit

function explode(sep, str, limit)

    if not sep or sep == "" then return false end

    if not str then return false end

    limit = limit or mhuge

    if limit == 0 or limit == 1 then return {str},1 end



    local r = {}

    local n, init = 0, 1



    while true do

        local s,e = strfind(str, sep, init, true)

        if not s then break end

        r[#r+1] = strsub(str, init, s - 1)

        init = e + 1

        n = n + 1

        if n == limit - 1 then break end

    end



    if init <= strlen(str) then

        r[#r+1] = strsub(str, init)

    else

        r[#r+1] = ""

    end

    n = n + 1



    if limit < 0 then

        for i=n, n + limit + 1, -1 do r[i] = nil end

        n = n + limit

    end



    return r, n

end

(Lance Li)


Function: Using metatable and __index

This function uses a metatable's __index function to populate the table of split parts. This function does not try to (correctly) invert the pattern, and so really doesn't work as most string split functions do.

--[[ written for Lua 5.1

split a string by a pattern, take care to create the "inverse" pattern 

yourself. default pattern splits by white space.

]]

string.split = function(str, pattern)

  pattern = pattern or "[^%s]+"

  if pattern:len() == 0 then pattern = "[^%s]+" end

  local parts = {__index = table.insert}

  setmetatable(parts, parts)

  str:gsub(pattern, parts)

  setmetatable(parts, nil)

  parts.__index = nil

  return parts

end

-- example 1

str = "no separators in this string"

parts = str:split( "[^,]+" )

print( # parts )

table.foreach(parts, print)

--[[ output:

1

1	no separators in this string

]]



-- example 2

str = "   split, comma, separated  , , string   "

parts = str:split( "[^,%s]+" )

print( # parts )

table.foreach(parts, print)

--[[ output:

4

1	split

2	comma

3	separated

4	string

]]


Function: true Python semantics for split

This is the Python behavior:

Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51) 

[GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2

>>> 'x!yy!zzz!@'.split('!')

['x', 'yy', 'zzz', '@']

>>> 'x!yy!zzz!@'.split('!', 3)

['x', 'yy', 'zzz', '@']

>>> 'x!yy!zzz!@'.split('!', 2)

['x', 'yy', 'zzz!@']

>>> 'x!yy!zzz!@'.split('!', 1)

['x', 'yy!zzz!@']

And IMHO this Lua function implements this semantics:

function string:split(sSeparator, nMax, bRegexp)

	assert(sSeparator ~= '')

	assert(nMax == nil or nMax >= 1)



	local aRecord = {}



	if self:len() > 0 then

		local bPlain = not bRegexp

		nMax = nMax or -1



		local nField=1 nStart=1

		local nFirst,nLast = self:find(sSeparator, nStart, bPlain)

		while nFirst and nMax ~= 0 do

			aRecord[nField] = self:sub(nStart, nFirst-1)

			nField = nField+1

			nStart = nLast+1

			nFirst,nLast = self:find(sSeparator, nStart, bPlain)

			nMax = nMax-1

		end

		aRecord[nField] = self:sub(nStart)

	end



	return aRecord

end

Observe the possibility to use simple strings or regular expressions as delimiters.

Test Cases:

Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio

...

> for k,v in next, string.split('x!yy!zzz!@', '!') do print(v) end

x

yy

zzz

@

> for k,v in next, string.split('x!yy!zzz!@', '!', 3) do print(v) end

x

yy

zzz

@

> for k,v in next, string.split('x!yy!zzz!@', '!', 2) do print(v) end

x

yy

zzz!@

> for k,v in next, string.split('x!yy!zzz!@', '!', 1) do print(v) end

x

yy!zzz!@

(JoanOrdinas)


If we define split simply as "return all 0-n character occurences followed by a separator, plus what's left of the string", which I think results in the most intuitive splitting logic, then we get a simple implementation using just gmatch which covers all cases and still allows the separator to be a pattern:

function gsplit(s,sep)

	return coroutine.wrap(function()

		if s == '' or sep == '' then coroutine.yield(s) return end

		local lasti = 1

		for v,i in s:gmatch('(.-)'..sep..'()') do

		   coroutine.yield(v)

		   lasti = i

		end

		coroutine.yield(s:sub(lasti))

	end)

end



--same idea without coroutines



function gsplit2(s,sep)

	local lasti, done, g = 1, false, s:gmatch('(.-)'..sep..'()')

	return function()

		if done then return end

		local v,i = g()

		if s == '' or sep == '' then done = true return s end

		if v == nil then done = true return s:sub(lasti) end

		lasti = i

		return v

	end

end



The gsplit() above returns an iterator, so other API variants can be easily derived from it:



{{{!Lua

function iunpack(i,s,v1)

   local function pass(...)

	  local v1 = i(s,v1)

	  if v1 == nil then return ... end

	  return v1, pass(...)

   end

   return pass()

end



function split(s,sep)

   return iunpack(gsplit(s,sep))

end



function accumulate(t,i,s,v)

    for v in i,s,v do

        t[#t+1] = v

    end

    return t

end



function tsplit(s,sep)

   return accumulate({}, gsplit(s,sep))

end

Note that the above implementation does not allow captures in the separator. To allow for that, another closure must be created to pass along the additional captured strings (see VarargTheSecondClassCitizen). The semantics also get muddy (I suppose one use case would be wanting to know what the actual separator was for each string, eg. for a separator pattern like [%.,;]).

function gsplit(s,sep)

	local i, done, g = 1, false, s:gmatch('(.-)'..sep..'()')

	local function pass(...)

		if ... == nil then done = true return s:sub(i) end

		i = select(select('#',...),...)

		return ...

	end

	return function()

		if done then return end

		if s == '' or sep == '' then done = true return s end

		return pass(g())

	end

end

The problem with the above implementation is that however easy to read, the (.-) pattern in Lua has awful performance, hence the following implementation based on only string.find (allows for captures in the separator and adds a third argument "plain", similar to string.find):

function string.gsplit(s, sep, plain)

	local start = 1

	local done = false

	local function pass(i, j, ...)

		if i then

			local seg = s:sub(start, i - 1)

			start = j + 1

			return seg, ...

		else

			done = true

			return s:sub(start)

		end

	end

	return function()

		if done then return end

		if sep == '' then done = true return s end

		return pass(s:find(sep, start, plain))

	end

end

Unit testing:


local function test(s,sep,expect)

	local t={} for c in s:gsplit(sep) do table.insert(t,c) end

	assert(#t == #expect)

	for i=1,#t do assert(t[i] == expect[i]) end

	test(t, expect)

end

test('','',{''})

test('','asdf',{''})

test('asdf','',{'asdf'})

test('', ',', {''})

test(',', ',', {'',''})

test('a', ',', {'a'})

test('a,b', ',', {'a','b'})

test('a,b,', ',', {'a','b',''})

test(',a,b', ',', {'','a','b'})

test(',a,b,', ',', {'','a','b',''})

test(',a,,b,', ',', {'','a','','b',''})

test('a,,b', ',', {'a','','b'})

test('asd  ,   fgh  ,;  qwe, rty.   ,jkl', '%s*[,.;]%s*', {'asd','fgh','','qwe','rty','','jkl'})

test('Spam eggs spam spam and ham', 'spam', {'Spam eggs ',' ',' and ham'})

(CosminApreutesei)





-- single char string splitter, sep *must* be a single char pattern

-- *probably* escaped with % if it has any special pattern meaning, eg "%." not "."

-- so good for splitting paths on "/" or "%." which is a common need



local function csplit(str,sep)

        local ret={}

        local n=1

        for w in str:gmatch("([^"..sep.."]*)") do

                        ret[n]=ret[n] or w -- only set once (so the blank after a string is ignored)

                        if w=="" then n=n+1 end -- step forwards on a blank but not a string

        end

        return ret

end



-- the following is true of any string, csplit will do the reverse of a concat



local str=""

print(str , assert( table.concat( csplit(str,"/") , "/" ) == str ) )



local str="only"

print(str , assert( table.concat( csplit(str,"/") , "/" ) == str ) )



local str="/test//ok/"

print(str , assert( table.concat( csplit(str,"/") , "/" ) == str ) )



local str=".test..ok."

print(str , assert( table.concat( csplit(str,"%.") , "." ) == str ) )




Comparison to Other Languages

User Comments

I mean no disrespect, of course, but.. does anyone actually have a working split function without glitches like infinite loops, wrong matches, or error cases? Are all those "takes" of any help here? -- CosminApreutesei

Try Rici Lake's split function: LuaList:2006-12/msg00414.html -- Jörg Richter

That version fails again when the pattern is an empty string. Specifications for split functions in other languages define how these corner cases should behave (see "Comparison to Other Languages" above). --David Manura

See Also


RecentChanges · preferences
edit · history
Last edited May 11, 2013 8:10 pm GMT (diff)