Getting The Title From Html Files

lua-users home
wiki

This example program prints out the titles of all HTML files passed to it on the command-line. It provides a rough example of parsing out data from HTML files. It it not necessarily robust though (e.g. consider the rare case of a comment line containing <!-- <title>ack</title> -->).

Usage example (from the shell):


$ ls *.html

cgi.html          htaccess.html  mod_include.html   urlmapping.html

configuring.html  mod_auth.html  mod_rewrite.html

core.html         mod_cgi.html   rewriteguide.html

$ ./title.lua *.html

cgi.html: Apache Tutorial: Dynamic Content with CGI

configuring.html: Configuration Files

core.html: Apache Core Features

htaccess.html: Apache Tutorial: .htaccess files

mod_auth.html: Apache module mod_auth

mod_cgi.html: Apache module mod_cgi

mod_include.html: Apache module mod_include

mod_rewrite.html: Apache module mod_rewrite

rewriteguide.html: Apache 1.3 URL Rewriting Guide

urlmapping.html: Mapping URLs to Filesystem Locations - Apache HTTP Server

Below is the Lua program title.lua:

#!/usr/bin/env lua



function getTitle(fname)

  local fp = io.open(fname, "r")

  if fp == nil then

    return false

  end



  -- Read up to 8KB (avoid problems when trying to parse /dev/urandom)

  local s = fp:read(8192)

  fp:close()



  -- Remove optional spaces from the tags.

  s = string.gsub(s, "\n", " ")

  s = string.gsub(s, " *< *", "<")

  s = string.gsub(s, " *> *", ">")



  -- Put all the tags in lowercase.

  s = string.gsub(s, "(<[^ >]+)", string.lower)



  local i, f, t = string.find(s, "<title>(.+)</title>")

  return t or ""

end



if arg[1] == nil then

  print("Usage: lua " .. arg[0] .. " <filename> [...]")

  os.exit(1)

end



i = 1

while arg[i] do

  t = getTitle(arg[i])

  if t then

    print(arg[i] .. ": " .. t)

  else

    print(arg[i] .. ": File opening error.")

  end

  i = i + 1

end

os.exit(0)

-- AlexandreErwinIttner


RecentChanges · preferences
edit · history
Last edited January 2, 2007 1:23 am GMT (diff)