[ Main Table Of Contents | Table Of Contents | Keyword Index ]

punkshell_module_punk::fileline(0) 0.1.0 doc "punk fileline"

Name

punkshell_module_punk::fileline - file line-handling utilities

Table Of Contents

Synopsis

Description

-

Overview

Utilities for in-memory analysis of text file data as both line data and byte/char-counted data whilst preserving the line-endings (even if mixed)

This is important for certain text files where examining the number of chars/bytes is important

For example - windows .cmd/.bat files need some byte counting to determine if labels lie on chunk boundaries and need to be moved.

This chunk-size counting will depend on the character encoding.

Despite including the word 'file', the library doesn't necessarily deal with reading/writing to the filesystem -

The raw data can be supplied as a string, or loaded from a file using punk::fileline::get_textinfo -file <filename>

Concepts

A chunk of textfile data (possibly representing a whole file - but usually at least a complete set of lines) is loaded into a punk::fileline::class::textinfo instance at object creation.

    package require punk::fileline
    package require fileutil
    set rawdata [fileutil::cat data.txt -translation binary]
    punk::fileline::class::textinfo create obj_data  $rawdata
    puts stdout [obj_data linecount]

Notes

Line records are referred to by a zero-based index instead of a one-based index as is commonly used when displaying files.

This is for programming consistency and convenience, and the module user should do their own conversion to one-based indexing for line display or messaging if desired.

No support for lone carriage-returns being interpreted as line-endings.

CR line-endings that are intended to be interpreted as such should be mapped to something else before the data is supplied to this module.

dependencies

packages needed by punk::fileline

  • Tcl 8.6-

  • punk::args

optional dependencies

packages that add functionality but aren't strictly required

  • punk::ansi

    - recommended for class::textinfo chunk_boundary_display

  • punk::char

    - recommended for class::textinfo chunk_boundary_display

  • overtype

    - recommended for class::textinfo chunk_boundary_display

API

Namespace punk::fileline::class

class definitions

  1. CLASS textinfo

    METHODS

    class::textinfo constructor datachunk ?option value...?

    Constructor for textinfo object which represents a chunk or all of a file

    datachunk should be passed with the file data including line-endings as-is for full functionality. ie use something like:

        fconfigure $fd -translation binary
        set chunkdata [read $fd]]
    or
        set chunkdata [fileutil::cat <filename> -translation binary]
    

    when loading the data

    class::textinfo chunk chunkstart chunkend

    Return a range of bytes from the underlying raw chunk data.

    e.g The following retrieves the entire chunk

    objName chunk 0 end

    class::textinfo chunklen

    Number of bytes/characters in the raw data of the file

    class::textinfo chunk_boundary_display

    Returns a string displaying the boundaries at chunksize bytes between chunkstart and chunkend

    Defaults to using ansi colour if punk::ansi module is available. Use -ansi 0 to disable colour

    class::textinfo linecount

    Number of lines in the raw data of the file, counted as per the policy in effect

    class::textinfo line lineindex

    Reconstructs and returns the raw line using the payload and per-line stored line-ending metadata

    A 'line' may be returned without a line-ending if the unerlying chunk had trailing data without a line-ending (or the chunk was loaded under a non-standard -policy setting)

    Whilst such data may not conform to definitions (e.g POSIX) of the terms 'textfile' and 'line' - it is useful here to represent it as a line with metadata le set to "none"

    To return just the data which might more commonly be needed for dealing with lines, use the linepayload method - which returns the line data minus line-ending

    class::textinfo linepayload_find_glob globsearch ?option value...?

    Return a lineinfolist (see lineinfo and lineinfolist) of lines where payload matches the globsearch string

    To limit the returned results use the -limit n option - where -limit 0 means return all matches.

    For example: linepayload_find_glob "*test*" -limit 1

    The result is always a list of lineinfo dictionaries even if one item is returned

    -limitfrom can be start|end

    The order of results is always the order as they occur in the data - even if -limitfrom end is specified.

    -limitfrom end means that only the last -limit items are returned

    Note that as glob accepts [chars]] to mean match any character in the set given by chars, searching for literal square brackets should be done by escaping the bracket with a backslash

    This is true even if only a single square bracket is being searched for. e.g {*[file*} will not find the word file followed by a left square-bracket - even though the search didn't close the square brackets.

    In the above case - the literal search should be {*\[file*}

    class::textinfo linepayload lineindex

    Return the text of the line indicated by the zero-based lineindex

    The line-ending is not returned in the data - but is still stored against this lineindex

    Line Metadata such as the line-ending for a particular line and the byte/character range it occupies within the chunk can be retrieved with the linemeta method

    To retrieve both the line text and metadata in a single call the lineinfo method can be used

    To retrieve an entire line including line-ending use the line method.

    class::textinfo linepayloads startindex endindex

    Return a list of just the payloads in the specified linindex range, with no metadata.

    class::textinfo linemeta lineindex

    Return a dict of the metadata for the line indicated by the zero-based lineindex

    Keys returned include

    • le

      A string representing the type of line-ending: crlf|lf|none

    • linelen

      The number of characters/bytes in the whole line including line-ending if any

    • payloadlen

      The number of character/bytes in the line excluding line-ending

    • start

      The zero-based index into the associated raw file data indicating at which byte/character index this line begins

    • end

      The zero-based index into the associated raw file data indicating at which byte/character index this line ends

      This end-point corresponds to the last character of the line-ending if any - not necessarily the last character of the line's payload

    class::textinfo lineinfo lineindex

    Return a dict of the metadata and text for the line indicated by the zero-based lineindex

    This returns the same info as the linemeta with an added key of 'payload' which is the text of the line without line-ending.

    The 'payload' value is the same as is returned from the linepayload method.

    class::textinfo lineinfolist startidx endidx

    Returns list of lineinfo dicts for each line in line index range startidx to endidx

    class::textinfo linerange_to_chunkrange startidx endidx
    class::textinfo linerange_to_chunk startidx endidx
    class::textinfo lines startidx endidx
    class::textinfo linepayloads startidx endidx
    class::textinfo chunkrange_to_linerange chunkstart chunkend
    class::textinfo chunkrange_to_lineinfolist chunkstart chunkend ?option value...?

    Return a list of dicts each with structure like the result of the lineinfo method - but possibly with extra keys for truncation information if -show_truncated 1 is supplied

    The truncation key in a lineinfo dict may be returned for first and/or last line in the resulting list.

    truncation shows the shortened (missing bytes on left and/or right side) part of the entire line (potentially including line-ending or even partial line-ending)

    Note that this truncation info is only in the return value of this method - and will not be reflected in lineinfo queries to the main chunk.

    class::textinfo numeric_linerange startidx endidx

    A helper to return any Tcl-style end end-x values given to startidx or endidx; converted to their specific values based on the current state of the underlying line data

    This is used internally by API functions such as line to enable it to accept more expressive indices

    class::textinfo numeric_chunkrange startidx endidx

    A helper to return any Tcl-style end end-x entries supplied to startidx or endidx; converted to their specific values based on the current state of the underlying chunk data

    class::textinfo normalize_indices startidx endidx max

    A utility to convert some of the of Tcl-style list-index expressions such as end, end-1 etc to valid indices in the range 0 to the supplied max

    Basic addition and subtraction expressions such as 4-1 5+2 are accepted

    startidx higher than endidx is allowed

    Unlike Tcl's index expressions - we raise an error if the calculated index is out of bounds 0 to max

    class::textinfo regenerate_lines

    generate a list of lines from the current state of the stored raw data chunk and keep a map of line-endings indexed by lineindex

    This is called automatically by the Constructor during object creation

    It is exposed in the API experimentally - as chunk and line manipulation functions are considered.

    TODO - review whether such manual control will be necessary/desirable

Namespace punk::fileline

Core API functions for punk::fileline

get_textinfo ?option value...? ?datachunk?

Returns textinfo object instance representing data in string datachunk or if -file filename supplied - data loaded from a file

The encoding used is as specified in the -encoding option - or from the Byte Order Mark (bom) at the beginning of the data

For Tcl 8.6 - encodings such as utf-16le may not be available - so the bytes are swapped appropriately depending on the platform byteOrder and encoding 'unicode' is used.

encoding defaults to utf-8 if no -encoding specified and no BOM was found

Specify -encoding binary to perform no encoding conversion

Whether -encoding was specified or not - by default the BOM characters are not retained in the line-data

If -includebom 1 is specified - the bom will be retained in the stored chunk and the data for line 1, but will undergo the same encoding transformation as the rest of the data

The get_bomid method of the returned object will contain an identifier for any BOM encountered.

e.g utf-8,utf-16be, utf-16le, utf-32be, utf32-le, SCSU, BOCU-1,GB18030, UTF-EBCDIC, utf-1, utf-7

If the encoding specified in the BOM isn't recognised by Tcl - the resulting data is likely to remain as the raw bytes (binary translation)

Currently only utf-8, utf-16* and utf-32* are properly supported even though the other BOMs are detected, reported via get_bomid, and stripped from the data.

GB18030 falls back to cp936/gbk (unless a gb18030 encoding has been installed). Use -encoding binary if this isn't suitable and you need to do your own processing of the raw data.

Namespace punk::fileline::lib

Secondary functions that are part of the API

lib::range_spans_chunk_boundaries start end chunksize

Takes start and end offset, generally representing bytes or character indices, and computes a list of boundaries at multiples of the chunksize that are spanned by the start and end range.

integer start

zero-based start index of range

integer end

zero-based end index of range

integer chunksize

Number of bytes/characters in chunk - must be positive and > 0

returns a dict with the keys is_span and boundaries

is_span 0|1 indicates if the range specified spans a boundary of chunksize

boundaries contains a list of the spanned boundaries - which are always multiples of the chunksize

e.g

    range_spans_chunk_boundaries 10 1750 512
    is_span 1 boundaries {512 1024 1536}

The -offset <int> option

    range_spans_chunk_boundaries 10 1750 512 -offset 2
    is_span 1 boundaries {514 1026 1538}

This function automatically uses lseq (if Tcl >= 8.7) when number of boundaries spanned is approximately greater than 75

Internal

Namespace punk::fileline::system

Internal functions that are not part of the API

Namespace punk::fileline::ansi

These are ansi functions imported from punk::ansi - or no-ops if that package is unavailable

See punk::ansi for documentation

ansi::a
ansi::a+
ansi::stripansi

Keywords

BOM, encoding, file, module, parse, text