Deno: read big file line by line

Created on 15 Feb 2020  路  3Comments  路  Source: denoland/deno

I am trying to read a big file (13,147,026 lines of text) line by line with deno but it's giving me error:

BufferFullError: Buffer full
    at BufReader.readSlice (bufio.ts:339:15)
    at async BufReader.readString (bufio.ts:217:20)
    at async stream_file (buffer.ts:9:18)

Here is my code:

import { BufReader } from "https://deno.land/std/io/bufio.ts";

export async function stream_file(filename: string) {
  const file = await Deno.open(filename);
  const bufReader = new BufReader(file);
  console.log("Reading data...");
  let line: string | any;
  let lineCount: number = 0;
  while ((line = await bufReader.readString("\n")) != Deno.EOF) {
    lineCount++;
    // do something with `line`.
  }
  file.close();
  console.log(`${lineCount} lines read.`);
}

versions:

deno 0.33.0
v8 8.1.108
typescript 3.7.2

I am trying to find an equivalent of node streams.
please advise!

Most helpful comment

Side note, feels like something we need to add to the benchmarks.

All 3 comments

In Deno, the I/O model is different. The Reader interface is what you want. The idea of a reader is to give the caller a way to request data while getting back pressure (which you can detect by checking to see if the Reader gave you the number of bytes you expected). If you aren't familiar with Go, this might seems a bit odd.

But to give an example of your code above:

import { BufReader } from "https://deno.land/[email protected]/io/mod.ts";
import { TextProtoReader } from "https://deno.land/[email protected]/textproto/mod.ts";
import { parse } from "https://deno.land/[email protected]/flags/mod.ts";
import { basename } from "https://deno.land/[email protected]/path/mod.ts";

export async function read(r: Deno.Reader) {
  const reader = new TextProtoReader(BufReader.create(r));
  console.log("Reading data...");

  let lineCount = 0;
  while (true) {
    let line = await reader.readLine();
    if (line === Deno.EOF) break;
    // do something with `line`
    lineCount += 1;
  }

  console.log(`${lineCount} lines read.`);
}

if (import.meta.main) {
  const args = parse(Deno.args, {
    boolean: ["h"],
    alias: {
      h: ["help"]
    }
  });

  if (args.h) {
    printUsage();
    Deno.exit(0);
  }

  const [filename] = args._;
  if (!filename) {
    printUsage();
    Deno.exit(1);
  }

  const file = filename === "-" ? Deno.stdin : await Deno.open(filename);
  await read(file);
  file.close();

  function printUsage() {
    console.error(
      `Usage: deno --allow-read ${basename(import.meta.url)} <filename>`
    );
  }
}

The thing to note is that I'm using TextProtoReader as a convenience because it has logic for reading lines when the underlying BufReader's buffer is full (you can check out the source鈥t's pretty straightforward).

I tried it with 3 approaches to read 307 MB test file :

  • Approach 1: (failed for big file - just worked for medium size file with 128,457 lines )
import { BufReader } from "https://deno.land/std/io/bufio.ts";

export async function stream_file(filename: string) {
  const file = await Deno.open(filename);
  const bufReader = new BufReader(file);
  console.log("Reading data...");
  let line: string | any;
  let lineCount: number = 0;
  while ((line = await bufReader.readString("\n")) != Deno.EOF) {
    lineCount++;
    // do something with `line`.
  }
  file.close();
  console.log(`${lineCount} lines read.`);
}
  • Approach 2:

    • CPU utilization: 18%

    • time to read data: 1':24"

import { BufReader } from "https://deno.land/[email protected]/io/mod.ts";
import { TextProtoReader } from "https://deno.land/[email protected]/textproto/mod.ts";

export async function textProtoReader(filename:string) {
  const r: Deno.Reader = await Deno.open(filename)
  const reader = new TextProtoReader(BufReader.create(r));
  console.log("Reading data...");

  let lineCount = 0;
  while (true) {
    let line = await reader.readLine();
    if (line === Deno.EOF) break;
    // do something with `line`
    lineCount += 1;
  }

  console.log(`${lineCount} lines read.`);
}
  • Approach 3:

    • CPU: 23%

    • time: 27"

import { BufReader } from "https://deno.land/std/io/bufio.ts";

export async function readLine(filename: string) {
  const file = await Deno.open(filename);
  const bufReader = new BufReader(file);
  console.log("Reading data...");
  let line: string | any;
  let lineCount: number = 0;
  while ((line = await bufReader.readLine()) != Deno.EOF) {
    lineCount++;
    // do something with `line`.
  }
  file.close();
  console.log(`${lineCount} lines read.`);
}

rust itself does it in

  • CPU: 19%
  • time: 13"
use std::fs::File;
use std::io::{self, BufRead};
use std::path::Path;
use std::io::{stdin, stdout, Read, Write};
use std::time::{Instant};

fn main() {
  let start = Instant::now();
  let mut counter = 0;
    // File hosts must exist in current path before this produces output
    if let Ok(lines) = read_lines("./enwik9") {
        // Consumes the iterator, returns an (Optional) String
        for _line in lines {
                counter = counter + 1;

        }
        println!("{}",counter)
      }
      let duration = start.elapsed();
      println!("Time elapsed in expensive_function() is: {:?}", duration);
      pause()
}


fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
where P: AsRef<Path>, {
    let file = File::open(filename)?;
    Ok(io::BufReader::new(file).lines())
}

fn pause() {
  let mut stdout = stdout();
  stdout.write(b"Press Enter to continue...").unwrap();
  stdout.flush().unwrap();
  stdin().read(&mut [0]).unwrap();
}

node does that:

  • CPU: 23%
  • time: 10"
const fs = require("fs");
const readline = require("readline");

async function processLineByLine() {
  console.log(Date());
  const fileStream = fs.createReadStream("./enwik9");

  const rl = readline.createInterface({
    input: fileStream,
    crlfDelay: Infinity
  });

  let counter = 0;
  for await (const line of rl) {
    counter++;
  }
  console.log(Date());
  return counter;
}

can anybody explain why it's different?
I guess the best way to read big file in deno is approach 3.

Side note, feels like something we need to add to the benchmarks.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

xueqingxiao picture xueqingxiao  路  3Comments

CruxCv picture CruxCv  路  3Comments

watilde picture watilde  路  3Comments

ry picture ry  路  3Comments

somombo picture somombo  路  3Comments