punkshell_module_punk::fileline - file line-handling utilities
Utilities for in-memory analysis of text file data as both line data and byte/char-counted data whilst preserving the line-endings (even if mixed)
This is important for certain text files where examining the number of chars/bytes is important
For example - windows .cmd/.bat files need some byte counting to determine if labels lie on chunk boundaries and need to be moved.
This chunk-size counting will depend on the character encoding.
Despite including the word 'file', the library doesn't necessarily deal with reading/writing to the filesystem -
The raw data can be supplied as a string, or loaded from a file using punk::fileline::get_textinfo -file <filename>
A chunk of textfile data (possibly representing a whole file - but usually at least a complete set of lines) is loaded into a punk::fileline::class::textinfo instance at object creation.
package require punk::fileline package require fileutil set rawdata [fileutil::cat data.txt -translation binary] punk::fileline::class::textinfo create obj_data $rawdata puts stdout [obj_data linecount]
Line records are referred to by a zero-based index instead of a one-based index as is commonly used when displaying files.
This is for programming consistency and convenience, and the module user should do their own conversion to one-based indexing for line display or messaging if desired.
No support for lone carriage-returns being interpreted as line-endings.
CR line-endings that are intended to be interpreted as such should be mapped to something else before the data is supplied to this module.
packages that add functionality but aren't strictly required
punk::ansi
- recommended for class::textinfo chunk_boundary_display
punk::char
- recommended for class::textinfo chunk_boundary_display
overtype
- recommended for class::textinfo chunk_boundary_display
class definitions
CLASS textinfo
METHODS
Constructor for textinfo object which represents a chunk or all of a file
datachunk should be passed with the file data including line-endings as-is for full functionality. ie use something like:
fconfigure $fd -translation binary set chunkdata [read $fd]] or set chunkdata [fileutil::cat <filename> -translation binary]
when loading the data
Return a range of bytes from the underlying raw chunk data.
e.g The following retrieves the entire chunk
objName chunk 0 end
Number of bytes/characters in the raw data of the file
Returns a string displaying the boundaries at chunksize bytes between chunkstart and chunkend
Defaults to using ansi colour if punk::ansi module is available. Use -ansi 0 to disable colour
Number of lines in the raw data of the file, counted as per the policy in effect
Reconstructs and returns the raw line using the payload and per-line stored line-ending metadata
A 'line' may be returned without a line-ending if the unerlying chunk had trailing data without a line-ending (or the chunk was loaded under a non-standard -policy setting)
Whilst such data may not conform to definitions (e.g POSIX) of the terms 'textfile' and 'line' - it is useful here to represent it as a line with metadata le set to "none"
To return just the data which might more commonly be needed for dealing with lines, use the linepayload method - which returns the line data minus line-ending
Return a lineinfolist (see lineinfo and lineinfolist) of lines where payload matches the globsearch string
To limit the returned results use the -limit n option - where -limit 0 means return all matches.
For example: linepayload_find_glob "*test*" -limit 1
The result is always a list of lineinfo dictionaries even if one item is returned
-limitfrom can be start|end
The order of results is always the order as they occur in the data - even if -limitfrom end is specified.
-limitfrom end means that only the last -limit items are returned
Note that as glob accepts [chars]] to mean match any character in the set given by chars, searching for literal square brackets should be done by escaping the bracket with a backslash
This is true even if only a single square bracket is being searched for. e.g {*[file*} will not find the word file followed by a left square-bracket - even though the search didn't close the square brackets.
In the above case - the literal search should be {*\[file*}
Return the text of the line indicated by the zero-based lineindex
The line-ending is not returned in the data - but is still stored against this lineindex
Line Metadata such as the line-ending for a particular line and the byte/character range it occupies within the chunk can be retrieved with the linemeta method
To retrieve both the line text and metadata in a single call the lineinfo method can be used
To retrieve an entire line including line-ending use the line method.
Return a list of just the payloads in the specified linindex range, with no metadata.
Return a dict of the metadata for the line indicated by the zero-based lineindex
Keys returned include
le
A string representing the type of line-ending: crlf|lf|none
linelen
The number of characters/bytes in the whole line including line-ending if any
payloadlen
The number of character/bytes in the line excluding line-ending
start
The zero-based index into the associated raw file data indicating at which byte/character index this line begins
end
The zero-based index into the associated raw file data indicating at which byte/character index this line ends
This end-point corresponds to the last character of the line-ending if any - not necessarily the last character of the line's payload
Return a dict of the metadata and text for the line indicated by the zero-based lineindex
This returns the same info as the linemeta with an added key of 'payload' which is the text of the line without line-ending.
The 'payload' value is the same as is returned from the linepayload method.
Returns list of lineinfo dicts for each line in line index range startidx to endidx
Return a list of dicts each with structure like the result of the lineinfo method - but possibly with extra keys for truncation information if -show_truncated 1 is supplied
The truncation key in a lineinfo dict may be returned for first and/or last line in the resulting list.
truncation shows the shortened (missing bytes on left and/or right side) part of the entire line (potentially including line-ending or even partial line-ending)
Note that this truncation info is only in the return value of this method - and will not be reflected in lineinfo queries to the main chunk.
A helper to return any Tcl-style end end-x values given to startidx or endidx; converted to their specific values based on the current state of the underlying line data
This is used internally by API functions such as line to enable it to accept more expressive indices
A helper to return any Tcl-style end end-x entries supplied to startidx or endidx; converted to their specific values based on the current state of the underlying chunk data
A utility to convert some of the of Tcl-style list-index expressions such as end, end-1 etc to valid indices in the range 0 to the supplied max
Basic addition and subtraction expressions such as 4-1 5+2 are accepted
startidx higher than endidx is allowed
Unlike Tcl's index expressions - we raise an error if the calculated index is out of bounds 0 to max
generate a list of lines from the current state of the stored raw data chunk and keep a map of line-endings indexed by lineindex
This is called automatically by the Constructor during object creation
It is exposed in the API experimentally - as chunk and line manipulation functions are considered.
TODO - review whether such manual control will be necessary/desirable
Core API functions for punk::fileline
Returns textinfo object instance representing data in string datachunk or if -file filename supplied - data loaded from a file
The encoding used is as specified in the -encoding option - or from the Byte Order Mark (bom) at the beginning of the data
For Tcl 8.6 - encodings such as utf-16le may not be available - so the bytes are swapped appropriately depending on the platform byteOrder and encoding 'unicode' is used.
encoding defaults to utf-8 if no -encoding specified and no BOM was found
Specify -encoding binary to perform no encoding conversion
Whether -encoding was specified or not - by default the BOM characters are not retained in the line-data
If -includebom 1 is specified - the bom will be retained in the stored chunk and the data for line 1, but will undergo the same encoding transformation as the rest of the data
The get_bomid method of the returned object will contain an identifier for any BOM encountered.
e.g utf-8,utf-16be, utf-16le, utf-32be, utf32-le, SCSU, BOCU-1,GB18030, UTF-EBCDIC, utf-1, utf-7
If the encoding specified in the BOM isn't recognised by Tcl - the resulting data is likely to remain as the raw bytes (binary translation)
Currently only utf-8, utf-16* and utf-32* are properly supported even though the other BOMs are detected, reported via get_bomid, and stripped from the data.
GB18030 falls back to cp936/gbk (unless a gb18030 encoding has been installed). Use -encoding binary if this isn't suitable and you need to do your own processing of the raw data.
Secondary functions that are part of the API
Takes start and end offset, generally representing bytes or character indices, and computes a list of boundaries at multiples of the chunksize that are spanned by the start and end range.
zero-based start index of range
zero-based end index of range
Number of bytes/characters in chunk - must be positive and > 0
returns a dict with the keys is_span and boundaries
is_span 0|1 indicates if the range specified spans a boundary of chunksize
boundaries contains a list of the spanned boundaries - which are always multiples of the chunksize
e.g
range_spans_chunk_boundaries 10 1750 512 is_span 1 boundaries {512 1024 1536}
The -offset <int> option
range_spans_chunk_boundaries 10 1750 512 -offset 2 is_span 1 boundaries {514 1026 1538}
This function automatically uses lseq (if Tcl >= 8.7) when number of boundaries spanned is approximately greater than 75
Internal functions that are not part of the API
These are ansi functions imported from punk::ansi - or no-ops if that package is unavailable
See punk::ansi for documentation
Copyright © 2024