Class BCF2Utils


  • public final class BCF2Utils
    extends Object
    Common utilities for working with BCF2 files Includes convenience methods for encoding, decoding BCF2 type descriptors (size + type)
    Since:
    5/12
    • Field Detail

      • MAX_ALLELES_IN_GENOTYPES

        public static final int MAX_ALLELES_IN_GENOTYPES
        See Also:
        Constant Field Values
      • OVERFLOW_ELEMENT_MARKER

        public static final int OVERFLOW_ELEMENT_MARKER
        See Also:
        Constant Field Values
      • INTEGER_TYPES_BY_SIZE

        public static final BCF2Type[] INTEGER_TYPES_BY_SIZE
      • ID_TO_ENUM

        public static final BCF2Type[] ID_TO_ENUM
    • Method Detail

      • makeDictionary

        public static ArrayList<String> makeDictionary​(VCFHeader header)
        Create a strings dictionary from the VCF header The dictionary is an ordered list of common VCF identifers (FILTER, INFO, and FORMAT) fields. Note that its critical that the list be dedupped and sorted in a consistent manner each time, as the BCF2 offsets are encoded relative to this dictionary, and if it isn't determined exactly the same way as in the header each time it's very bad
        Parameters:
        header - the VCFHeader from which to build the dictionary
        Returns:
        a non-null dictionary of elements, may be empty
      • encodeTypeDescriptor

        public static byte encodeTypeDescriptor​(int nElements,
                                                BCF2Type type)
      • decodeSize

        public static int decodeSize​(byte typeDescriptor)
      • decodeTypeID

        public static int decodeTypeID​(byte typeDescriptor)
      • decodeType

        public static BCF2Type decodeType​(byte typeDescriptor)
      • sizeIsOverflow

        public static boolean sizeIsOverflow​(byte typeDescriptor)
      • collapseStringList

        public static String collapseStringList​(List<String> strings)
        Collapse multiple strings into a comma separated list ["s1", "s2", "s3"] => ",s1,s2,s3"
        Parameters:
        strings - size > 1 list of strings
        Returns:
      • explodeStringList

        public static List<String> explodeStringList​(String collapsed)
        Inverse operation of collapseStringList. ",s1,s2,s3" => ["s1", "s2", "s3"]
        Parameters:
        collapsed -
        Returns:
      • isCollapsedString

        public static boolean isCollapsedString​(String s)
      • shadowBCF

        public static final File shadowBCF​(File vcfFile)
        Returns a good name for a shadow BCF file for vcfFile. foo.vcf => foo.bcf foo.xxx => foo.xxx.bcf If the resulting BCF file cannot be written, return null. Happens when vcfFile = /dev/null for example
        Parameters:
        vcfFile -
        Returns:
        the BCF
      • determineIntegerType

        public static BCF2Type determineIntegerType​(int value)
      • determineIntegerType

        public static BCF2Type determineIntegerType​(int[] values)
      • maxIntegerType

        public static BCF2Type maxIntegerType​(BCF2Type t1,
                                              BCF2Type t2)
        Returns the maximum BCF2 integer size of t1 and t2 For example, if t1 == INT8 and t2 == INT16 returns INT16
        Parameters:
        t1 -
        t2 -
        Returns:
      • toList

        public static <T> List<T> toList​(Class<T> c,
                                         Object o)
        Helper function that takes an object and returns a list representation of it: o == null => [] o is a list => o else => [o]
        Parameters:
        c - the class of the object
        o - the object to convert to a Java List
        Returns:
      • headerLinesAreOrderedConsistently

        public static boolean headerLinesAreOrderedConsistently​(VCFHeader outputHeader,
                                                                VCFHeader genotypesBlockHeader)
        Are the elements and their order in the output and input headers consistent so that we can write out the raw genotypes block without decoding and recoding it? If the order of INFO, FILTER, or contrig elements in the output header is different than in the input header we must decode the blocks using the input header and then recode them based on the new output order. If they are consistent, we can simply pass through the raw genotypes block bytes, which is a *huge* performance win for large blocks. Many common operations on BCF2 files (merging them for -nt, selecting a subset of records, etc) don't modify the ordering of the header fields and so can safely pass through the genotypes undecoded. Some operations -- those at add filters or info fields -- can change the ordering of the header fields and so produce invalid BCF2 files if the genotypes aren't decoded